Errors (reading file and I/O) for one of 9 pooled assemblies

tonyaseverson commented 7 months ago

I ran RNA-Bloom in pool mode without a reference. Each replicate from a species/condition was included in a pool. Nine pools were submitted as separate array jobs. All nine pooled assemblies ran to completion (apparently), but when I checked the error logs, one of the nine contained a lot of ominous messages starting with "rnabloom.io.FileFormatException: Error reading file" and I/O errors and exceptions. I'm wondering if this is just noise or something pernicious.

[2.0.1] version of RNA-Bloom with java -jar RNA-Bloom.jar -version
[17.0.6] version of java with java -version
[java -Xmx10g -jar ~/bin/thirdparty/RNA-Bloom_v2.0.1/RNA-Bloom.jar -stranded -ntcard -fpr 0.005 -k 25-75:5 -extend -t 32 -outdir $outdir -rcr -pool "$sample_list"] exact command used to run RNA-Bloom

I've attached my sample list: 20231109_rnabloom_pool.pdf

I've also attached log files containing output and error messages captured by SLURM for the affected samples. The out.pdf contains the RNA-Bloom progress report output after a bunch of details that I routinely log. The err.pdf file contains the error messages that have me concerned: 12875770_3_rnabloom.out.pdf 12875770_3_rnabloom.err.pdf

Thank you!

kmnip commented 7 months ago

Hi @tonyaseverson ,

~~Can you please post the content of your pool list file at /home/tfsevers/manifests/20231109_rnabloom_pool.txt?~~ ~~I want to see whether this file was configured correctly.~~ EDIT: Sorry, I missed the list file and it looks fine to me!

The errors in the error log appeared to originate from the 2nd stage of the assembly.

This assembly was definitely incomplete because only one paired of FASTQ files was parsed in the 2nd stage. I am suspecting that your input paired FASTQs do not have the same number of reads. Can you please confirm this? e.g.

cd /home/tfsevers/scratch/data/transcriptomes/resynthesized_bnapus/bbsplit_cleaned/10734406/
zcat bn3501_1.cleaned.cleaned.fq.gz | wc -l
zcat bn3501_2.cleaned.cleaned.fq.gz | wc -l

If the returned numbers do not matched, then you need to fix your FASTQ files. Based on the file name suffix (.cleaned.cleaned.fq.gz), I am guessing that you had used an adaptor/QC-trimming tool? If so, then you need to specify the unpaired reads to be output separately from the paired reads for your read-trimming tool.

tonyaseverson commented 7 months ago

I trimmed and filtered with BBDuk and filtered contaminants with BBSplit and it appears that at one step I allowed singletons to slip through. If that is the issue, then it is strange that none of the other pools' logs contained error messages.

I'll take a look at my previous steps and see where the singletons crept in.

Thanks!

tonyaseverson commented 7 months ago

I checked and each pair of fastq files contain the same number of reads: FastQC indicated same number of reads for each pair, and wc -l indicates same number of lines.

bn3501_1.cleaned.cleaned.fq.gz: 565804728 bn3501_2.cleaned.cleaned.fq.gz: 565804728 bn3502_1.cleaned.cleaned.fq.gz: 526033816 bn3502_2.cleaned.cleaned.fq.gz: 526033816 bn3503_1.cleaned.cleaned.fq.gz: 588690464 bn3503_2.cleaned.cleaned.fq.gz: 588690464

I didn't lose a lot of reads to filtering, so just ran RNA-Bloom on the complete pairs.

tonyaseverson commented 7 months ago

I checked the headers in read1 and read2 files, and the ordering also seems correct for all of the fastq files in the failed pool. So the root cause of the failed pool assembly doesn't appear to be mixing of paired and single reads.

kmnip commented 7 months ago

Thanks for confirming the number of reads.

Can you split the pool list file into 3, one for each of bn3501, bn3502, bn3503? And, generate the assemblies for each of the sub-pools.

If there are no issues with these assemblies, then please try re-running the entire pooled assembly from scratch.

tonyaseverson commented 7 months ago

If I understand correctly, I should run reads from the bn3501, bn3502, and bn3503 libraries separately, like I would run bulk RNA?

My array job splits the list of 27 libraries into 9 sub pool lists containing 3 libraries each, and each list was processed in a separate SLURM array task that invoked RNA-Bloom. Libraries bn3501, bn3502, and bn3503 were in one of the subpools. My experiment involves 3 species (and 3 different temperature treatments), so each subpool contains replicates from a single species/temperature combination. Perhaps a better term for the list of 27 would be metapool, with the names indicating the actual pools that were assembled separately.

On Nov 22, 2023, at 12:11 AM, Ka Ming Nip @.***> wrote:

Thanks for confirming the number of reads.

Can you split the pool list file into 3, one for each of bn3501, bn3502, bn3503? And, generate the assemblies for each of the sub-pools.

If there are no issues with these assemblies, then please try re-running the entire pooled assembly from scratch.

— Reply to this email directly, view it on GitHub https://github.com/bcgsc/RNA-Bloom/issues/63#issuecomment-1822290450, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADG7YUMMGMYCT5OOR6YPKXDYFWXSXAVCNFSM6AAAAAA7QP4XDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRSGI4TANBVGA. You are receiving this because you were mentioned.

kmnip commented 7 months ago

I wanted to see whether you would encounter any issues if you only use a subset of your read files. You could set up the following two assemblies, manually without array job:

Assembly 1:

bn3501 /path/to/bn3501_2.cleaned.cleaned.fq.gz /path/to/bn3501_1.cleaned.cleaned.fq.gz

Assembly 2:

bn3502 /path/to/bn3502_2.cleaned.cleaned.fq.gz /path/to/bn3502_1.cleaned.cleaned.fq.gz
bn3503 /path/to/bn3503_2.cleaned.cleaned.fq.gz /path/to/bn3503_1.cleaned.cleaned.fq.gz

Also, I recommend not to specify the error log path in your batch script (e.g. #SBATCH -e log.err). Let stderr integrate directly into your output log, so that you can see when exactly the error would happen within one file.

tonyaseverson commented 7 months ago

@kmnip - I ran each of the bn35 sequences individually, and in each possible combination of two, and I did not have any errors in my .out files. I also reran all three in pooled mode (name = bn35), and rnabloom ran to completion with no errors.

I launched the full pooled assembly, but this time a different pool threw errors, although it involved the same combination of sequences as one of the 8 pools that were successfully assembled in parallel with the original pool that failed. So it doesn't seem to be a problem with file integrity. Could it be due to some kind of race condition, thread management problem, or hardware issue?

I had a different slurm job that terminated the same day, with seff output indicating it was due to a node failure.

kmnip commented 7 months ago

That's very strange. I guess that rules out file formatting... And, I cannot replicate it on my end using my own test data mimicking your list file.

I do not think it has anything to do with race condition or thread management because that part of the code for file reading is synchronized in Java. Also, my fellow colleagues and I have assembled much larger datasets with more CPUs. If it is a multithreading issue, you can simply lower the -t option. But, I don't think that is the issue here.

I recommend checking whether the error is observed on the same HPC node. That should tell you whether it is a hardware issue.

On a side note, I noticed this message in your stderr log:

Picked up JAVA_TOOL_OPTIONS: -Xmx2g

I suggest increasing that to a larger value, e.g. -Xmx20g. However, I don't think this is the cause of the issue. Otherwise, you will run into an out-of-memory error instead of what you had in the stderr log.

tonyaseverson commented 7 months ago

I have a hunch that my filesystem was getting bogged down. I'm a data hoarder and ls -lah started hanging; I got to thinking that the symptoms were sort of like my issue with rnabloom. I removed a bunch of files that I no longer need and reran rnabloom to assemble transcripts for pooled replicates of each species-condition combination and for each species. Both ran to completion with no errors. Sorry for the distraction.

kmnip commented 7 months ago

Glad to hear that you figured it out. Thanks for using RNA-Bloom!

bcgsc / RNA-Bloom

Errors (reading file and I/O) for one of 9 pooled assemblies #63