[JHPCE] repeated fastqc failure on trimmed paired data

gpertea commented 1 year ago

Running in /dcs04/lieber/lieber_jhshin/RNAseq_LIBD4020/spqz/p2_n443 - see the SPEAQeasy.log there. I tried resuming this a few times but I couldn't figure out why I keep getting this error in this sample R23587 :

Approx 95% complete for R23587_trimmed_paired_2.fastq
cp: cannot stat 'R23587_unpaired.fastq_trimmed_paired_1_fastqc/summary.txt': No such file or directory
cp: cannot stat 'R23587_unpaired.fastq_trimmed_paired_1_fastqc/fastqc_data.txt': No such file or directory

Work dir:
  /dcs04/lieber/lieber_jhshin/RNAseq_LIBD4020/spqz/p2_n443/wrk/f8/2b16a7d8c0878f212b19f5039e8a47

There are no such R23587_unpaired* entries in that working directory, only paired ones, so I don't get the meaning of the error (I guess it's possible that after trimming the full pairing is maintained, so I hope the script took that possibility into account and should not error out like this).

Nick-Eagles commented 1 year ago

Hey, it's not obvious to me on the surface why this would happen, and I don't have permissions to view anything under /dcs04/lieber/lieber_jhshin/RNAseq_LIBD4020/spqz/p2_n443. I'm not sure the best way to grant permissions here so I can take a look, though I'd likely just need to see SPEAQeasy_output.log, and the working directories for this process (QualityTrimmed) and maybe Trimming for this sample.

gpertea commented 1 year ago

Ah, forgot about the restrictive permissions for that group (sequencing core limitation).

I copied that directory tree (except any temp fastq file for other samples but R23587), in this directory: /dcs04/lieber/lcolladotor/dbDev_LIBD001/spqz_p2_dbg/ I also put the original FASTQ for that sample, just in case, under the fastq_R23587 directory there. Thank you for looking into this.

Nick-Eagles commented 1 year ago

There's a couple things here that are really strange. The process "tag" for QualityTrimmed for that sample is R23587_unpaired.fastq, and I'm fairly certain that the get_prefix function can't return a value like that (_unpaired should get removed). Also, this line should catch bad prefixes like R23587_unpaired.fastq. Is your SPEAQeasy fairly recent? It also doesn't make sense how a file containing R23587_unpaired.fastq anywhere in its name could end up in the channel for QualityTrimmed inputs based on that glob.

Next, I can't think of any possible explanation for how R23587_unpaired.fastq was calculated as the prefix here, given that, as you mentioned, only R23587_trimmed_paired* files exist in the working directory.

It looks like I made note a couple years ago of running into this type of issue, but in that case, a totally different sample ID was used than the files in the working directory! If I recall correctly, I simply resumed the pipeline and the error vanished. Rarely, I do encounter a problem like that, where I suspect a bug in Nextflow itself might be the cause. I suppose that's not a helpful answer, though, if you've tried resuming a few times.

I'm really getting the feeling that this is a Nextflow bug. While I would recommend trying a newer version of Nextflow, I'm not sure resuming will work. Another hack could even be to delete the work directory (/dcs04/lieber/lieber_jhshin/RNAseq_LIBD4020/spqz/p2_n443/wrk/8e/1af402136c02977c3ceef2aecc4bce) for Trimming for that problematic sample, to see if somehow re-running trimming and recalculating the input channel to QualityUntrimmed resets things.

gpertea commented 1 year ago

In terms of SPEAQeasy version, this should be the most recent version from Github/master (I had to use that due to the SLURM changes on JHPCE), deployed recently in /dcs04/lieber/lcolladotor/dbDev_LIBD001/SPEAQeasy ..with only a few (minor, I hope) changes, committed to the branch jhpce_custom. In this branch I slightly adjusted the wrapper script and had a few configuration changes - also tried a couple of submitRateLimit alternatives in conf/slurm.config in an attempt to get rid of that pesky sbatch: error: slurm_set_addr: Unable to resolve "usher05" (which can be seen in .nextflow.2 file there, from a previous attempt).

As seen in run_jhpce_g25m.sh I am loading nextflow/20.01.0, as per the original wrapper script in SPEAQeasy. The only other version of nextflow module on JHPCE3 is 22.10.7, which seems to be broken as noted in your message today on bithelp to the "other Nick" :) where you also suggested to stick to 20.01.0 ..

I guess I could install a new version and restart the whole thing - this would be good to clarify this problem -- maybe we're luckier and even get rid of the other error related to DNS resolution..

I thought the same nextflow 20.01 version worked fine with SPEAQeasy on SGE, maybe this version has some bugs with SLURM (or our SLURM setup) ?!..

gpertea commented 1 year ago

Another even more messed up example showing how in SPEAQeasy/SLURM/nextflow 20.01 now the wrong file names are somehow being picked up by the workflow - like you commented above.

This was for another, very similar run (part 1 of the same batch), that until now it kept giving me the sbatch Unable to resolve "usher05" error, so I kept resuming it with different values of submitRateLimit -- but today it suddenly failed in QualityTrimmed like the other run, but worse:

Take a look in [UPDATE: copied/edit dir due to permissions]: /dcs04/lieber/lcolladotor/dbDev_LIBD001/spqz_p1_dbg/wrk/16/65d14c46767f1efb124fa15ec1a472/ .command.sh there operates first on R11485 but then it tries to copy R11665* result files ! 😱

#!/bin/bash -euo pipefail
fastqc -t 2 --extract  R11485_trimmed_paired_1.fastq R11485_trimmed_paired_2.fastq
cp R11665_trimmed_paired_1_fastqc/summary.txt R11665_1_trimmed_summary.txt && cp R11665_trimmed_paired_2_fastqc/summary.txt R11665_2_trimmed_summary.txt
cp R11665_trimmed_paired_1_fastqc/fastqc_data.txt R11665_1_trimmed_fastqc_data.txt && cp R11665_trimmed_paired_2_fastqc/fastqc_data.txt R11665_2_trimmed_fastqc_data.txt

How is this even possible ?!

Nick-Eagles commented 1 year ago

Yeah, that's extremely strange. While I do think there's a bug with how Nextflow is handling the input channels, it's also true that QualityTrimmed handles the input channel in a way that's different than any other process operating on a single sample in the pipeline. Again, if 99% of samples are working as expected, and the failing sample seems to change each time (though I suppose it's possible that it's just the order of failure that's changing-- which doesn't seem to be true in past similar cases), I don't think the channel logic is flawed.

Regardless, maybe the right move is for me to make the logic match other processes operating on one sample at a time, and never uncover the specific reason why things aren't working.

Yeah, given that the newer nextflow module is broken, and your SPEAQeasy version is very recent and barely modified, reinstallation or trying a different nextflow version probably isn't worth the effort.

I'll experiment with changing the channel logic and get back to you.

Nick-Eagles commented 1 year ago

I expect that above commit to address the issue, though it wasn't really possible for me to replicate the issue for testing. I made the channel handling match every other single-sample process, and as far as I know, no other process has produced the issue we've seen here.

gpertea commented 1 year ago

Thanks - I rebased on master (and checked that the commit d814185 was applied), unfortunately the 3 attempts to resume p1_n442 since that updated resulted in failures like this (different R# everytime, and false - those sampleIDs are actually present in the manifest file..):

Error: deduced the sample ID 'R11042' from the file 'R11042_trimmed_paired_1.fastq', but this ID is not in 'samples.manifest'. This is likely a bug in SPEAQeasy!

I copied SPEAQeasy_output.log, .nextflow.log, .nextflow.log.1 and .nextflow.log.2 into /dcs04/lieber/lcolladotor/dbDev_LIBD001/spqz_p1_dbg/ if it helps. Any other files that could be useful here? This error does not even show a specific working directory.

gpertea commented 1 year ago

Just got the same error in the other batch - copied the updated logs in /dcs04/lieber/lcolladotor/dbDev_LIBD001/spqz_p2_dbg/

That took a while to manifest the same bug, because it first failed with that annoying sbatch: error: get_addr_info: getaddrinfo() failed: Name or service not known error -- even though I tried 1 / 5s etc. submit rates, does not seem to help.

By the way, should we make a separate issue to keep track of that other problematic behavior of SPEAQeasy on SLURM ?

I agree with you that the sbatch error does NOT seem directly a bug in SPEAQeasy code, but rather nextflow/SLURM, it still seems to affect SPEAQeasy preferentially..

Nick-Eagles commented 1 year ago

Given that the last change didn't work, I'm thinking using a more recent version of Nextflow might be worth it now. I can build a module and let you know (it'll also involve a small command-line change too).

Regarding the sbatch issue, I had kind of assumed it only tended to come up with SPEAQeasy because there are few other contexts where such a large number of individual jobs were being submitted. In my opinion it's worth making an issue on SPEAQeasy only if it seems to be coming up disproportionately often with SPEAQeasy relative to the number of jobs being submitted. I haven't run SPEAQeasy recently enough to know, but maybe you have a feeling about that. Otherwise, it strikes me as purely a SLURM configuration issue at JHPCE, that might be possible but probably tough to bandaid with changes to SPEAQeasy settings (at least given that the most intuitive setting to change, submitRateLimit, didn't seem to work).

gpertea commented 1 year ago

An update to this: I installed the latest nextflow that still supports DSL1 (22.10.7) which of course does not work with java 19 on JHPCE3, so I had to also install java 18. Unfortunately, the issue persist in the trimming step, maybe even worse. Now SPEAQeasy also fails on a small example of just 10 samples that I ran in this directory (where you should have full access): /dcs04/lieber/lcolladotor/dbDev_LIBD001/spqz_dbg_n10

That example worked fine on SLURM a couple of weeks ago. However now that seems to fail with an error like this:

  Approx 85% complete for R11429_trimmed_paired_2.fastq
  Approx 90% complete for R11429_trimmed_paired_2.fastq
  Approx 95% complete for R11429_trimmed_paired_2.fastq
  cp: cannot stat 'R11429_trimmed_paired_1_fastqc/summary.txt': No such file or directory
  cp: cannot stat 'R11429_trimmed_paired_1_fastqc/fastqc_data.txt': No such file or directory

Work dir:
  /dcs04/lieber/lcolladotor/dbDev_LIBD001/spqz_dbg_n10/wrk/b9/4b59f7147130eae1019198f35427bb

The .comand.sh there simply lacks the fastqc processing of the _1 mate, it only runs with _2:

#!/bin/bash -euo pipefail
fastqc -t 2 --extract  R11429_trimmed_paired_2.fastq
cp R11429_trimmed_paired_1_fastqc/summary.txt R11429_1_trimmed_summary.txt && cp R11429_trimmed_paired_2_fastqc/summary.txt R11429_2_trimmed_summary.txt
cp R11429_trimmed_paired_1_fastqc/fastqc_data.txt R11429_1_trimmed_fastqc_data.txt && cp R11429_trimmed_paired_2_fastqc/fastqc_data.txt R11429_2_trimmed_fastqc_data.txt

Should the latest input channel logic changes be reverted? I think this problem became even worse after the last patch.

Hopefully this small example can be used for debugging this issue.

Nick-Eagles commented 1 year ago

That's not good. I was actually hoping though to try the latest nextflow, as I believe there's an option to explicitly specify DSL1 at run time for backwards compatibility, even though the default changes to DSL2 after 22.10.7. Thanks for creating the small subset I can work with! I'll play around with nextflow version and possibly channel logic again.

Nick-Eagles commented 12 months ago

As a brief update, I was wrong about Nextflow supporting DSL1 in any capacity after 22.10.7. Also, on my first test of the small example, the full pipeline completed without any issues. I modified (a copy of) main.nf to print the contents of the channel to QualityTrimmed, which also looked exactly as expected. I suppose I'll try slight variations and hope the bug shows up.

Nick-Eagles commented 12 months ago

I'm using a slightly modified version (/dcs04/lieber/lcolladotor/dbDev_LIBD001/spqz_dbg_n10/run_jhpce_g25m_nick.sh) of your wrapper, which points to /dcs04/lieber/lcolladotor/dbDev_LIBD001/SPEAQeasy/main_temp_test.nf, what I believe to be functionally identical to your version of SPEAQeasy where the issue occurred (the difference is one line of code which should simply print the contents of a channel from Trimming to QualityTrimmed). I also used your nextflow/22.10.7 module. I tried 3 runs, all of which completed without any issues. The variations I tried were just setting --trim_mode to skip, adaptive, and force, in case that had some relevant impact on the channel where issues had occurred. I accidentally left in the line export NXF_DEFAULT_DSL=1, but this shouldn't have had any impact on how nextflow ran in these tests. Very weird. I don't see any meaningful difference between our wrapper scripts, either.

Nick-Eagles commented 9 months ago

Hey Geo, the most recent SPEAQeasy uses DSL2, and because the channel management changed substantially, I wouldn't be surprised if this issue is resolved now. I also made an effort to maintain the pairing of filenames with sample IDs, rather than re-deriving the sampleID from filenames in each process: this also reduces the risk of unexpected mismatches.

LieberInstitute / SPEAQeasy

[JHPCE] repeated fastqc failure on trimmed paired data #100