SystemsGenetics / GEMmaker

A workflow for construction of Gene Expression count Matrices (GEMs). Useful for Differential Gene Expression (DGE) analysis and Gene Co-Expression Network (GCN) construction
https://gemmaker.readthedocs.io/en/latest/
MIT License
33 stars 16 forks source link

Fixes for multiple issues with SRAs #204

Closed spficklin closed 3 years ago

spficklin commented 3 years ago

I apologize for the size of this PR. I didn't mean for it to get so big, but I just kept fixing issues trying to get the Arabidopsis 26k to run with GEMmaker.

Issue #193 (cleanup of files) Issue #173 (missing data on NCBI) And other issues I found

Description

I've been having trouble running with the main.nf 17K arabidopsis samples (26K runs) on Kamiak and though many days of restarting the workflow I've resolved several bugs and problems.

Here are examples of problems

Problem 1: During the sra2fastq process.

The SRA is downloaded just fine from NCBI but is corrupted

Command error:
  2021-03-14T19:12:50 fastq-dump.2.10.0 err: data corrupt while executing function within virtual database module - failed SRR3461660.sra

Solution: see the fix described for problem 4 below.

Problem 2: Too many open files

This problem seemed to be caused by file sockets not being closed.

Caused by:
  /work/7b/33bcc82fa8f368c501bd1a71f507ee/.exitcode: Too many open files
[32/b7392f] NOTE: Process `next_sample (SRX1734988)` failed -- Execution is retried (3)

Solution: I think this was caused by not setting the lock file to NULL in the groovy code of the next_sample process. I'm not 100% sure, but so far after doing this I have not encountered the problem again.

Problem 3: Files Not Being Cleaned up.

Both in the sra2fastq.py script and the retreive_sra.py script there was a cleanup step if a job failed and the cleanup was not happening. So, files could build up if there was continuous failures.

Solution: See the fix for problem 4 below.

Problem 4: Failed Runs Stop Everything

If the sra2fastq.py or retrieve_sra.py scripts failed then the entire workflow stops. With 26K samples that becomes too time consuming to keep up with whenever a sample is missing from NCBI or is corrupted through no fault of GEMmaker.

Solution: Neither the sra2fastq.py or the retrieve_sra.py scripts fail if a sample fails. Instead both scripts now do the following.

  1. They write a new file that contains JSON that lists the SRR or SRA name that failed and the reason it failed.
  2. If a run or sample fail then they touch a file named sample_failed indicating to GEMmaker that this sample should not continue.
  3. Both scripts cleanup any files that would normally trigger GEMmaker to move to the next step (e.g. fastq files, sra files).
  4. The processes for both these two steps now watch for the sample_failed file. If present then it automatically triggers the next_sample process and does not proceed with the next steps.
  5. Finally, there is a new process named failed_run_report. It checks all of the files described in item 1 above and generate an HTML report that lists all of the failed runs and the reason for it. The following is a screenshot of the report.
    image

Problem 5: Bug in retrieve_sra.py

The retrieve_sra.py had a bug in it that prevented the https:// protocol from being used as a second attempt if Aspera failed.

Solution: It was a problem in checking the return code of Aspera and the code was fixed.

Notes:

Testing?

For Problems 1, 3 and 4:

You can test these fixes by adding the following to your SRA_IDs file and just running GEMmaker with the local defaults:

# Corrupted SRA after download
SRR3461660
# Metadata is not available
SRR2927686

For problems 2:

I don't know how to offer a way to test this. We'll just have to wait and see if it happens again.

For problem 5:

I don't have a way to test this either. Just take a look at the code in the retrieve_sra.py file where the code switches to the https protocol if Aspera failed and make sure it makes sense to you.

JohnHadish commented 3 years ago

Tested on Kamiak using the Slurm Scheduler. Found the following Issues:

1) Unexpected behavior in failed_runs.metadata.txt

There is unexpected behavior in the failed_runs.metadata.txt file, it appears to be reporting that other files with the same SRX number were not imported, but does not report on the actual "Corrupted" run (SRR3461660).

The failed_runs.metadata.txt file:

{
    "SRR3461661": "Notice: the run, SRR3461661, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
    "SRR3461659": "Notice: the run, SRR3461659, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
    "SRR3461658": "Notice: the run, SRR3461658, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
    "SRR3461657": "Notice: the run, SRR3461657, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
    "SRR3461656": "Notice: the run, SRR3461656, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
    "SRR3461670": "Notice: the run, SRR3461670, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
    "SRR3461669": "Notice: the run, SRR3461669, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
    "SRR3461668": "Notice: the run, SRR3461668, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
    "SRR3461667": "Notice: the run, SRR3461667, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
    "SRR3461666": "Notice: the run, SRR3461666, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
    "SRR3461665": "Notice: the run, SRR3461665, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
    "SRR3461664": "Notice: the run, SRR3461664, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
    "SRR3461663": "Notice: the run, SRR3461663, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
    "SRR3461662": "Notice: the run, SRR3461662, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
    "SRR2927686": "Metadata was not returned by NCBI for this run."
}

2) Directory with Actual Run from SRA (SRR649944 from SRX218012) has failed run files which are almost empty

Here is the directory:

total 2136
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 671859 Mar 25 12:10 SRX218012_1_fastqc.html
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 402113 Mar 25 12:10 SRX218012_1_fastqc.zip
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 679856 Mar 25 12:10 SRX218012_2_fastqc.html
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 403427 Mar 25 12:10 SRX218012_2_fastqc.zip
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin      2 Mar 25 12:08 SRX218012.failed_runs.download.txt
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin      2 Mar 25 12:08 SRX218012.failed_runs.fastq-dump.txt
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin    592 Mar 25 12:10 SRX218012.kallisto.log
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin     24 Mar 25 12:11 SRX218012_vs_CORG.Kallisto.raw
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin     39 Mar 25 12:11 SRX218012_vs_CORG.Kallisto.tpm

Everything appears to have ran fine, but there are two unexpected files (SRX218012.failed_runs.download.txt and SRX218012.failed_runs.fastq-dump.txt ) that only contain a {}

The SRX1734704/SRX1734704.failed_runs.fastq-dump.txt file:

{
    "SRR3461660": "2021-03-25T19:10:37 fastq-dump.2.10.0 err: data corrupt while executing function within virtual database module - failed SRR3461660.sra\n\n=============================================================\nAn error occurred during processing.\nA report was generated into the file '/home/john.hadish/ncbi_error_report.txt'.\nIf the problem persists, you may consider sending the file\nto 'sra-tools@ncbi.nlm.nih.gov' for assistance.\n=============================================================\n\n"

The ``

ll
total 4
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin   2 Mar 25 12:09 SRX1734704.failed_runs.download.txt
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 519 Mar 25 12:10 SRX1734704.failed_runs.fastq-dump.txt

3 Output Directory Created for SRX1734704 (containing SRR3461660) even though it failed. Has 2 files:

The directory looks like this:

-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin   2 Mar 25 12:09 SRX1734704.failed_runs.download.txt
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 519 Mar 25 12:10 SRX1734704.failed_runs.fastq-dump.txt

with the first file only containing a {} and the second file containing:

{
    "SRR3461660": "2021-03-25T19:10:37 fastq-dump.2.10.0 err: data corrupt while executing function within virtual database module - failed SRR3461660.sra\n\n=============================================================\nAn error occurred during processing.\nA report was generated into the file '/home/john.hadish/ncbi_error_report.txt'.\nIf the problem persists, you may consider sending the file\nto 'sra-tools@ncbi.nlm.nih.gov' for assistance.\n=============================================================\n\n"
}

You will remember from issue 1 that this run (SRR3461660) was not reported in the failed_runs.metadata.txt file

4 There is no failed_runs_report.py

I think you forgot to push it probably. Process fails because there is no file for it.

spficklin commented 3 years ago

Thanks for the review @JohnHadish . My responses are in order of your comments

  1. For item 1. This isn't unexpected behavior, it is intentional. But I forgot to mention it in my description above. It warns the user that they are excluding runs that are part of a larger sample. It shouldn't cause the sample to be skipped, because GEMmaker will still process what was downloaded. It just lets the user know so they can fix the problem if they want to. edit: oh and also, the reason it doesn't report on the corrupted file is because it doesn't find out that the file is corrupted until after it is downloaded and it tries to dump it to fastq. There are three places where there could be issues with runs:
    • retrieving SRAs (i.e. no metadata) in the retrieve_sra_metadata process
    • downloading SRAs in the download_run process
    • dumping SRAs to fastq in the fastq-dump process
  2. The two "unexpected files" in the output/SRX218012 directory are actually empty indicating that no failed runs occurred. One is from the download_runs process and the other is from the fastq-dump process. For samples with failed runs then those files will have entries in them respectively. These are there because we want to publish these to samples when a run fails in case the end-user wants to see them. I guess technically if they are empty it might seem confusing for them to be there but it easier to just them them be there empty then to figure out how to only publish non-empty files.
  3. So the reason there are files in the output directory is for the same reason as 2 above. With this PR GEMmaker publishes the two files that list failed runs from the download and dump processes, so every sample even if the fail or succeed gets a set. The first file is empty because no runs failed on download. The second file has an entry because the file is corrupted.
  4. Did you look in the reports folder for that failed reports file? If there was a missing file GEMmaker would fail. IF you got it to complete then it should be there. Check in the files folder of GEMmaker and there should be a template file for the report. That would be the only file the report requires.
JohnHadish commented 3 years ago

Here are some clarifications about my concerns that were not adressed in your response.

For Ease of reading, I will be refering to the two test files you supplied as

SRR_Corrupted (SRR3461660)

SRR_NoMetaData (SRR2927686).

These belong to SRX projects

SRX_CORRUPTED

SRX_NOMETADATA

respectively. I have used capitalization to make the difference beteen the SRR abd SRX clear.

1 SRR_Corrupted is not mentioned in the file failed_runs.metadata.txt

SRR_Corrupted is not mentioned in the failed_runs.metadata.txt.

All files associated with SRX_CORRUPTED are mentioned besides SRR_Corrupted

In contrast, SRR_NoMetaData is mentioned, but none of the other files associated with SRX_NOMETADATA are mentioned in this file

2 and 3 There is only a directory for SRX_CORRUPTED

For consistency of output I expect either a directory for all of the samples that fail, or no directories for these samples. There is currently only a directory for one of them.

4 There is no _script named failed_runsreport.py in the bin directory

The script is missing. I see it mentioned, but it is not on GitHub

spficklin commented 3 years ago

Update... The conversation was getting a bit long, so @JohnHadish and I spoke by Zoom and made sense of everything above. Except, I had forgotten to commit the script for generating the final report, which has now been done. So, @JohnHadish is going to retest.

JohnHadish commented 3 years ago

The failed runs report is now created: image

For some reason the process "MultiQC" hung for over an hour without producing a report. I am rerunning the test just to make sure this is not related

JohnHadish commented 3 years ago

This pull request appears to work as described, except that it breaks the following features:

I am not sure what the proper way to fix -resume. Suspect it has to do with the ghost file system where things are zeroed out to save space.

I suspected that MULTIQC failing is caused by the new directory SRA_META. I tried adding an additonal ignore to the main.nf script, but it still fails to complete.

It would also be nice to have some new documentation describing how GEMmaker deals with these files, and the diffences between the possible errors. I like the html report, but to someone that does not know what they are, it would be kind of confusing.

spficklin commented 3 years ago

Thanks @JohnHadish

The multiqc report generates on my end. I have a timestamp on the report that is the same time as the new failed runs report. So, it's working for me.

The -resume flag is working on my current run on Kamiak. I'm currently running this version and I can see that it has skipped some samples because they are cached. I'll look a bit more at this and see why it looks like it might not be working.

And FYI.... the SRA_META folder is not new to this PR. It was there before. It was added in PR #192

spficklin commented 3 years ago

@JohnHadish I just confirmed the -resume does work. Below is my output after running this branch a second time, and it ran very fast because all was already done.

[b6/4457f0] process > retrieve_sra_metadata (1)                [100%] 1 of 1 ✔
[ee/3f19e1] process > write_stage_files (SRX1734704)           [100%] 5 of 5 ✔
[66/6fbbfd] process > start_first_batch                        [100%] 1 of 1 ✔
[6c/0f632e] process > read_sample_file (SRX218012.sample.csv)  [100%] 5 of 5 ✔
[68/aad37d] process > next_sample (1)                          [100%] 5 of 5 ✔
[3d/41dc96] process > download_runs (SRX218012)                [100%] 2 of 2, cached: 2 ✔
[40/ed8706] process > fastq_dump (SRX218012)                   [100%] 2 of 2, cached: 2 ✔
[d7/ccc275] process > fastq_merge (SRX218012)                  [100%] 1 of 1, cached: 1 ✔
[7a/beacbb] process > fastqc_1 (SRX218012)                     [100%] 4 of 4, cached: 4 ✔
[e8/98ca04] process > kallisto (SRX218012)                     [100%] 4 of 4, cached: 4 ✔
[14/aee624] process > kallisto_tpm (SRX218012)                 [100%] 4 of 4, cached: 4 ✔
[-        ] process > salmon                                   -
[-        ] process > salmon_tpm                               -
[-        ] process > trimmomatic                              -
[-        ] process > fastqc_2                                 -
[-        ] process > hisat2                                   -
[-        ] process > samtools_sort                            -
[-        ] process > samtools_index                           -
[-        ] process > stringtie                                -
[-        ] process > hisat2_fpkm_tpm                          -
[62/42660e] process > multiqc                                  [100%] 1 of 1 ✔
[4e/c1b9bf] process > create_gem                               [100%] 1 of 1 ✔
[a8/f200eb] process > failed_run_report (1)                    [100%] 1 of 1 ✔
[bc/5bf2f7] process > clean_sra (SRX218012)                    [100%] 2 of 2, cached: 2 ✔
[f8/d06b7c] process > clean_fastq (SRX218012)                  [100%] 1 of 1, cached: 1 ✔
[06/02f3ee] process > clean_merged_fastq (SRX218012)           [100%] 1 of 1, cached: 1 ✔
[-        ] process > clean_trimmed_fastq                      -
[-        ] process > clean_sam                                -
[-        ] process > clean_bam                                -
[d8/3703f0] process > clean_kallisto_ga (SRX218012)            [100%] 4 of 4, cached: 4 ✔
[-        ] process > clean_salmon_ga                          -
[-        ] process > clean_stringtie_ga                       -
JohnHadish commented 3 years ago

Tested more, I am still not sure why it was failing on Kamiak backfill nodes, but it appears to run if I use our nodes. Was not receiving any error that I had been booted off of a node.

spficklin commented 3 years ago

Oh good point! I'll add that information to the documentation so it doesn't get forgotten, then I'll merge the PR. Thanks @JohnHadish

spficklin commented 3 years ago

Documentation added, merging now.