Closed spficklin closed 3 years ago
Tested on Kamiak using the Slurm Scheduler. Found the following Issues:
There is unexpected behavior in the failed_runs.metadata.txt
file, it appears to be reporting that other files with the same SRX number were not imported, but does not report on the actual "Corrupted" run (SRR3461660).
The failed_runs.metadata.txt
file:
{
"SRR3461661": "Notice: the run, SRR3461661, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
"SRR3461659": "Notice: the run, SRR3461659, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
"SRR3461658": "Notice: the run, SRR3461658, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
"SRR3461657": "Notice: the run, SRR3461657, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
"SRR3461656": "Notice: the run, SRR3461656, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
"SRR3461670": "Notice: the run, SRR3461670, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
"SRR3461669": "Notice: the run, SRR3461669, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
"SRR3461668": "Notice: the run, SRR3461668, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
"SRR3461667": "Notice: the run, SRR3461667, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
"SRR3461666": "Notice: the run, SRR3461666, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
"SRR3461665": "Notice: the run, SRR3461665, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
"SRR3461664": "Notice: the run, SRR3461664, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
"SRR3461663": "Notice: the run, SRR3461663, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
"SRR3461662": "Notice: the run, SRR3461662, is part of experiment SRX1734704 but was not included in the input SRA_IDS.txt file. Do you want to include this run?",
"SRR2927686": "Metadata was not returned by NCBI for this run."
}
Here is the directory:
total 2136
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 671859 Mar 25 12:10 SRX218012_1_fastqc.html
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 402113 Mar 25 12:10 SRX218012_1_fastqc.zip
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 679856 Mar 25 12:10 SRX218012_2_fastqc.html
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 403427 Mar 25 12:10 SRX218012_2_fastqc.zip
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 2 Mar 25 12:08 SRX218012.failed_runs.download.txt
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 2 Mar 25 12:08 SRX218012.failed_runs.fastq-dump.txt
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 592 Mar 25 12:10 SRX218012.kallisto.log
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 24 Mar 25 12:11 SRX218012_vs_CORG.Kallisto.raw
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 39 Mar 25 12:11 SRX218012_vs_CORG.Kallisto.tpm
Everything appears to have ran fine, but there are two unexpected files (SRX218012.failed_runs.download.txt
and SRX218012.failed_runs.fastq-dump.txt
) that only contain a {}
The SRX1734704/SRX1734704.failed_runs.fastq-dump.txt
file:
{
"SRR3461660": "2021-03-25T19:10:37 fastq-dump.2.10.0 err: data corrupt while executing function within virtual database module - failed SRR3461660.sra\n\n=============================================================\nAn error occurred during processing.\nA report was generated into the file '/home/john.hadish/ncbi_error_report.txt'.\nIf the problem persists, you may consider sending the file\nto 'sra-tools@ncbi.nlm.nih.gov' for assistance.\n=============================================================\n\n"
The ``
ll
total 4
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 2 Mar 25 12:09 SRX1734704.failed_runs.download.txt
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 519 Mar 25 12:10 SRX1734704.failed_runs.fastq-dump.txt
The directory looks like this:
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 2 Mar 25 12:09 SRX1734704.failed_runs.download.txt
-rw-rw-r-- 2 john.hadish its_p_sys_ur_kam-ficklin 519 Mar 25 12:10 SRX1734704.failed_runs.fastq-dump.txt
with the first file only containing a {}
and the second file containing:
{
"SRR3461660": "2021-03-25T19:10:37 fastq-dump.2.10.0 err: data corrupt while executing function within virtual database module - failed SRR3461660.sra\n\n=============================================================\nAn error occurred during processing.\nA report was generated into the file '/home/john.hadish/ncbi_error_report.txt'.\nIf the problem persists, you may consider sending the file\nto 'sra-tools@ncbi.nlm.nih.gov' for assistance.\n=============================================================\n\n"
}
You will remember from issue 1 that this run (SRR3461660) was not reported in the failed_runs.metadata.txt
file
I think you forgot to push it probably. Process fails because there is no file for it.
Thanks for the review @JohnHadish . My responses are in order of your comments
download_run process
fastq-dump process
output/SRX218012
directory are actually empty indicating that no failed runs occurred. One is from the download_runs
process and the other is from the fastq-dump
process. For samples with failed runs then those files will have entries in them respectively. These are there because we want to publish these to samples when a run fails in case the end-user wants to see them. I guess technically if they are empty it might seem confusing for them to be there but it easier to just them them be there empty then to figure out how to only publish non-empty files.files
folder of GEMmaker and there should be a template file for the report. That would be the only file the report requires.Here are some clarifications about my concerns that were not adressed in your response.
For Ease of reading, I will be refering to the two test files you supplied as
SRR_Corrupted
(SRR3461660)
SRR_NoMetaData
(SRR2927686).
These belong to SRX projects
SRX_CORRUPTED
SRX_NOMETADATA
respectively. I have used capitalization to make the difference beteen the SRR abd SRX clear.
SRR_Corrupted is not mentioned in the failed_runs.metadata.txt
.
All files associated with SRX_CORRUPTED are mentioned besides SRR_Corrupted
In contrast, SRR_NoMetaData is mentioned, but none of the other files associated with SRX_NOMETADATA are mentioned in this file
For consistency of output I expect either a directory for all of the samples that fail, or no directories for these samples. There is currently only a directory for one of them.
The script is missing. I see it mentioned, but it is not on GitHub
Update... The conversation was getting a bit long, so @JohnHadish and I spoke by Zoom and made sense of everything above. Except, I had forgotten to commit the script for generating the final report, which has now been done. So, @JohnHadish is going to retest.
The failed runs report is now created:
For some reason the process "MultiQC" hung for over an hour without producing a report. I am rerunning the test just to make sure this is not related
This pull request appears to work as described, except that it breaks the following features:
-resume
tag that can be added to a run command no longer works. GEMmaker appears to try to resume, but then just re-does everythingI am not sure what the proper way to fix -resume
. Suspect it has to do with the ghost file system where things are zeroed out to save space.
I suspected that MULTIQC failing is caused by the new directory SRA_META
. I tried adding an additonal ignore to the main.nf
script, but it still fails to complete.
It would also be nice to have some new documentation describing how GEMmaker deals with these files, and the diffences between the possible errors. I like the html report, but to someone that does not know what they are, it would be kind of confusing.
Thanks @JohnHadish
The multiqc report generates on my end. I have a timestamp on the report that is the same time as the new failed runs report. So, it's working for me.
The -resume flag is working on my current run on Kamiak. I'm currently running this version and I can see that it has skipped some samples because they are cached. I'll look a bit more at this and see why it looks like it might not be working.
And FYI.... the SRA_META folder is not new to this PR. It was there before. It was added in PR #192
@JohnHadish I just confirmed the -resume does work. Below is my output after running this branch a second time, and it ran very fast because all was already done.
[b6/4457f0] process > retrieve_sra_metadata (1) [100%] 1 of 1 ✔
[ee/3f19e1] process > write_stage_files (SRX1734704) [100%] 5 of 5 ✔
[66/6fbbfd] process > start_first_batch [100%] 1 of 1 ✔
[6c/0f632e] process > read_sample_file (SRX218012.sample.csv) [100%] 5 of 5 ✔
[68/aad37d] process > next_sample (1) [100%] 5 of 5 ✔
[3d/41dc96] process > download_runs (SRX218012) [100%] 2 of 2, cached: 2 ✔
[40/ed8706] process > fastq_dump (SRX218012) [100%] 2 of 2, cached: 2 ✔
[d7/ccc275] process > fastq_merge (SRX218012) [100%] 1 of 1, cached: 1 ✔
[7a/beacbb] process > fastqc_1 (SRX218012) [100%] 4 of 4, cached: 4 ✔
[e8/98ca04] process > kallisto (SRX218012) [100%] 4 of 4, cached: 4 ✔
[14/aee624] process > kallisto_tpm (SRX218012) [100%] 4 of 4, cached: 4 ✔
[- ] process > salmon -
[- ] process > salmon_tpm -
[- ] process > trimmomatic -
[- ] process > fastqc_2 -
[- ] process > hisat2 -
[- ] process > samtools_sort -
[- ] process > samtools_index -
[- ] process > stringtie -
[- ] process > hisat2_fpkm_tpm -
[62/42660e] process > multiqc [100%] 1 of 1 ✔
[4e/c1b9bf] process > create_gem [100%] 1 of 1 ✔
[a8/f200eb] process > failed_run_report (1) [100%] 1 of 1 ✔
[bc/5bf2f7] process > clean_sra (SRX218012) [100%] 2 of 2, cached: 2 ✔
[f8/d06b7c] process > clean_fastq (SRX218012) [100%] 1 of 1, cached: 1 ✔
[06/02f3ee] process > clean_merged_fastq (SRX218012) [100%] 1 of 1, cached: 1 ✔
[- ] process > clean_trimmed_fastq -
[- ] process > clean_sam -
[- ] process > clean_bam -
[d8/3703f0] process > clean_kallisto_ga (SRX218012) [100%] 4 of 4, cached: 4 ✔
[- ] process > clean_salmon_ga -
[- ] process > clean_stringtie_ga -
Tested more, I am still not sure why it was failing on Kamiak backfill nodes, but it appears to run if I use our nodes. Was not receiving any error that I had been booted off of a node.
Oh good point! I'll add that information to the documentation so it doesn't get forgotten, then I'll merge the PR. Thanks @JohnHadish
Documentation added, merging now.
I apologize for the size of this PR. I didn't mean for it to get so big, but I just kept fixing issues trying to get the Arabidopsis 26k to run with GEMmaker.
Issue #193 (cleanup of files) Issue #173 (missing data on NCBI) And other issues I found
Description
I've been having trouble running with the
main.nf
17K arabidopsis samples (26K runs) on Kamiak and though many days of restarting the workflow I've resolved several bugs and problems.Here are examples of problems
Problem 1: During the sra2fastq process.
The SRA is downloaded just fine from NCBI but is corrupted
Solution: see the fix described for problem 4 below.
Problem 2: Too many open files
This problem seemed to be caused by file sockets not being closed.
Solution: I think this was caused by not setting the lock file to NULL in the groovy code of the
next_sample
process. I'm not 100% sure, but so far after doing this I have not encountered the problem again.Problem 3: Files Not Being Cleaned up.
Both in the
sra2fastq.py
script and theretreive_sra.py
script there was a cleanup step if a job failed and the cleanup was not happening. So, files could build up if there was continuous failures.Solution: See the fix for problem 4 below.
Problem 4: Failed Runs Stop Everything
If the
sra2fastq.py
orretrieve_sra.py
scripts failed then the entire workflow stops. With 26K samples that becomes too time consuming to keep up with whenever a sample is missing from NCBI or is corrupted through no fault of GEMmaker.Solution: Neither the
sra2fastq.py
or theretrieve_sra.py
scripts fail if a sample fails. Instead both scripts now do the following.sample_failed
indicating to GEMmaker that this sample should not continue.sample_failed
file. If present then it automatically triggers thenext_sample
process and does not proceed with the next steps.failed_run_report
. It checks all of the files described in item 1 above and generate an HTML report that lists all of the failed runs and the reason for it. The following is a screenshot of the report.Problem 5: Bug in
retrieve_sra.py
The
retrieve_sra.py
had a bug in it that prevented the https:// protocol from being used as a second attempt if Aspera failed.Solution: It was a problem in checking the return code of Aspera and the code was fixed.
Notes:
Testing?
For Problems 1, 3 and 4:
You can test these fixes by adding the following to your SRA_IDs file and just running GEMmaker with the local defaults:
For problems 2:
I don't know how to offer a way to test this. We'll just have to wait and see if it happens again.
For problem 5:
I don't have a way to test this either. Just take a look at the code in the
retrieve_sra.py
file where the code switches to the https protocol if Aspera failed and make sure it makes sense to you.