UPHL-BioNGS / Cecret

Reference-based consensus creation
MIT License
50 stars 27 forks source link

BWA error for paired fastqs in Cecret #224

Closed DrB-S closed 1 year ago

DrB-S commented 1 year ago

I am running the latest version of Cecret on paired-end fastqs as follows: nextflow run UPHL-BioNGS/Cecret -profile singularity -c configs/sarscov.config --medcpus 70 --maxcpus 100 --reads reads --relatedness true --msa nextalign

The reads files all end with _R(1,2}.fastq A typical paired name is: 2023CV1635.BC4L11.2023-06-29.01.C03_R1.fastq and 2023CV1635.BC4L11.2023-06-29.01.C03_R2.fastq

They were derived from an interleaved fastq file (2023CV1635.BC4L11.2023-06-29.01.C03.fastq.gz) after unzipping using the following:

fastaq deinterleave 2023CV1635.BC4L11.2023-06-29.01.C03.fastq 2023CV1635.BC4L11.2023-06-29.01.C03_R1.fastq 2023CV1635.BC4L11.2023-06-29.01.C03_R2.fastq

Here is the error: ERROR ~ Error executing process > 'CECRET:cecret:bwa (2023CV1635.BC4L11.2023-06-29.01.C03)'

Caused by: Process CECRET:cecret:bwa (2023CV1635.BC4L11.2023-06-29.01.C03) terminated with an error exit status (1)

Command executed:

mkdir -p bwa logs/CECRET:cecret:bwa log=logs/CECRET:cecret:bwa/2023CV1635.BC4L11.2023-06-29.01.C03.2e441589-1655-44d3-8de3-d34884116268.log

time stamp + capturing tool versions

date > $log echo "bwa $(bwa 2>&1 | grep Version )" >> $log bwa_version="bwa : "$(bwa 2>&1 | grep Version)

index the reference fasta file

bwa index MN908947.3.fasta

bwa mem command

bwa mem -t 100 MN908947.3.fasta 2023CV1635.BC4L11.2023-06-29.01.C03_clean_PE1.fastq.gz 2023CV1635.BC4L11.2023-06-29.01.C03_clean_PE2.fastq.gz > bwa/2023CV1635.BC4L11.2023-06-29.01.C03.sam

Command exit status: 1

Command output: (empty)

Command error: [bwa_index] Pack FASTA... 0.00 sec [bwa_index] Construct BWT for the packed sequence... [bwa_index] 0.01 seconds elapse. [bwa_index] Update BWT... 0.00 sec [bwa_index] Pack forward-only FASTA... 0.00 sec [bwa_index] Construct SA from BWT and Occ... 0.00 sec [main] Version: 0.7.17-r1188 [main] CMD: bwa index MN908947.3.fasta [main] Real time: 0.028 sec; CPU: 0.017 sec [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 3830 sequences (550465 bp)... [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (505, 0, 0, 501) [M::mem_pestat] analyzing insert size distribution for orientation FF... [M::mem_pestat] (25, 50, 75) percentile: (709, 3832, 7080) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 19822) [M::mem_pestat] mean and std.dev: (4089.34, 3196.52) [M::mem_pestat] low and high boundaries for proper pairs: (1, 26193) [M::mem_pestat] skip orientation FR as there are not enough pairs [M::mem_pestat] skip orientation RF as there are not enough pairs [M::mem_pestat] analyzing insert size distribution for orientation RR... [M::mem_pestat] (25, 50, 75) percentile: (1059, 3892, 7015) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 18927) [M::mem_pestat] mean and std.dev: (4197.31, 3100.22) [M::mem_pestat] low and high boundaries for proper pairs: (1, 24883) [mem_sam_pe] paired reads have different names: "00056560-4cc8-4869-a2f5-855b469d586d", "001c0b61-b674-4486-81fa-7ffac2957da1"

Work dir: /data/Sequence_analysis/Cecret/Analyses/Covid_Terra_6Sep23/work/33/d5997c0f46ce6f7cdd83b12ec362af

=== NOTE: The same error occurred when I did not deinterleave the files first!

erinyoung commented 1 year ago

Oh no!

This is from https://github.com/lh3/bwa/issues/228#issuecomment-534769169 , which may apply here

If bwa is terminated with this error, it usually signifies that pair reads in both FASTQ files are disordered.

From BBTools Repair Guide:
With paired reads in 2 files, the first read in file 1 must be the mate of the first read in file 2, etc. For paired reads in a single interleaved file, the second read is the mate of the first read, and the 4th read is the mate of the 3rd read, etc.

There are few solutions which worked for me.

Sort FASTQ files
BBTools repair.sh is supposed to do that, but it seems it cannot fix large datasets as it loads whole file to RAM.
FASTQ-SORT is way better for large FASTQs. It buffers RAM on hard drive, it successfully sorted ~ 150GB file.
Use another aligner
Bowtie 2 doesn't require the condition of mates in the same order, but while aligning disordered FASTQs with Bowtie 2 (as of my experience) you get less mapped reads.

So you can try to repair your fastq files with bbtools' repair.sh or trying use minimap for alignment (set params.aligner = 'minimap2')

erinyoung commented 1 year ago

You could also try just keeping the files interleaved and treating them as single reads. It's not "as good", but should "still work".

DrB-S commented 1 year ago

Thanks! I'll try minimap2 first on the non-interleaved files.

DrB-S commented 1 year ago

It is running nicely with minimap2!

erinyoung commented 1 year ago

VICTORY!!!

DrB-S commented 1 year ago

I ran twice using minimap2:

  1. With the complete paired-end sequences, which died at CECRET:qc:aci
  2. With downsampled (to 30x) paired-end sequences, which died at CECRET:sarscov2:vadr (QC metrics).
DrB-S commented 1 year ago

Error Output - Full Fastq Seqs.txt

Error Output - Downsampled Fastqs.txt

erinyoung commented 1 year ago

I have a lot of questions!

The vadr error is suggestive that your fasta files may be empty.

It's weird that ACI worked the second time, so I wonder if it's a memory issue.

I generally like to use

nextflow run UPHL-BIoNGS/Cecret --reads <directory to reads>

Nextflow sometimes has issues with complicated filenames with multiple numbers in them, so I wonder if your input files are getting paired correctly. If you look in the output/log files for fastqc and seqyclean, does it look like it should? If not, you may need to use a sample sheet to get your reads into the workflow.

DrB-S commented 1 year ago

Here is the command-line for both:
nextflow run UPHL-BioNGS/Cecret -profile singularity -c configs/sarscov.config --medcpus 70 --maxcpus 100 --reads reads --relatedness true --msa nextalign --aligner minimp2

erinyoung commented 1 year ago

What do your fastqc or seqyclean output files/log files look like?

DrB-S commented 1 year ago

I see the problem. There are 2 consensus files for each sample, rather than assembling the fastqs as read-pairs. Here is an example: -rw-rw-r-- 1 becksts becksts 30111 Sep 6 12:38 2023CV1724.BC4L11.2023-07-13.01.D01.consensus.fa -rw-rw-r-- 1 becksts becksts 29685 Sep 6 10:47 2023CV1724.BC4L11.2023-07-13.01.D01_R.consensus.fa

There is a problem with the fastq file names after deinterleaving: 2023CV1724.BC4L11.2023-07-13.01.D01_R1.fastq 2023CV1724.BC4L11.2023-07-13.01.D01_R2.fastq

The same problem happens when using the .fq ending.

DrB-S commented 1 year ago

My suspicion is that they may not actually be read-pairs at all, and that I will need to use the original fastqs....

erinyoung commented 1 year ago

It might be the file names.

In my mind 2023CV1724.BC4L11.2023-07-13.01.D01_R1.fastq would obviously get paired with 2023CV1724.BC4L11.2023-07-13.01.D01_R2.fastq, but nextflow might pair it was 2023CV1714.BC4L11.2023-07-13.01.D01_R1.fastq.

If you can't find the original fastq files, I recommend using a sample sheet. https://github.com/UPHL-BioNGS/Cecret#using-a-sample-sheet (and hopefully that resolves the issue)

DrB-S commented 1 year ago

I have the originals. I'll run those first.

DrB-S commented 1 year ago

I ran the original fastqs using the originals with the following command: nextflow run UPHL-BioNGS/Cecret -profile singularity -c configs/sarscov.config --medcpus 70 --maxcpus 100 --single_reads single_reads --relatedness true --msa nextalign --aligner minimap2

It ran all the way through to the summary and died. The consensus files are predominantly N's, so they don't appear to be matching with MN908947. I used 'single_reads' because the deinterleaved fastqs did not work (was this a problem?). A snp matrix is produced, which shows that they are all very similar to each other and to MN908947....

I have attached amplicon_depth.csv from the aci directory: amplicon_depth.csv.zip

erinyoung commented 1 year ago

It died on the summary!!!?!?!?!

It shouldn't be a problem to run them as single-reads. It's just not as good.

Can you share a fastq file with me?

There's gotta be something I'm not looking for.

DrB-S commented 1 year ago

When I use the --reads flag, cecret quits, immediately after starting. But when I use --single_reads, it runs. I have it running one more time....

Attached is the smallest fastq.gz (1.8M). The largest is 141M. 2023CV2184.BC4L11.2023-08-22.01.H02.fastq.gz

erinyoung commented 1 year ago

This isn't going to copy well, but this is a copy of the *hist file from the samtools coverage process. This file doesn't have enough reads. Right now the default mean depth for ivar to call variants is 100X, and this was a mean depth of less than 10 across the entire genome.

2023CV2184.cov.hist.txt

MN908947.3 (29.9Kbp)
>  89.70% │    ▁             ▆    █                   ▅      │ Number of reads: 313
>  79.73% │    █         █   █    █   ▂               █   ▆  │
>  69.77% │▁   ██        █   █  ▁ █   █   ▇           █   █  │ Covered bases:   12.4Kbp
>  59.80% │█   ██        █▅ ▃█  █▂█   █   █▂▅         █▄ ▅█  │ Percent covered: 41.37%
>  49.83% │█   ██        ██ ██  ███   █▅  ███  ▁      ██ ██  │ Mean coverage:   0.963x
>  39.87% │█   ██     ▂  ██ ██  ███   ██  ███  █      ██ ██  │ Mean baseQ:      31.2
>  29.90% │█   ██ ▄   █▅███ ██▄ ███▄  ██  ███  █      ██ ██▇ │ Mean mapQ:       60
>  19.93% │█   ██ █▂  █████▁███▇████▇████▇███  █ ▇    ██ ███ │
>   9.97% │█  ▁██▃██ ▄███████████████████████▇▁█ █  ▅ ██ ███ │ Histo bin width: 598bp
>   0.00% │█  ██████ ███████████████████████████▂█  █ ██ ███ │ Histo max bin:   99.666%
          1        6.0K     12.0K     17.9K     23.9K      29.9K

I think there's less than 3500 reads in that fastq file.

Good news, though, most of the reads were SARS-CoV-2. Here's the Kraken2 report:

$ cat cecret/kraken2/2023CV2184_kraken2_report.txt
 10.32  331     331     U       0       unclassified
 89.68  2877    0       R       1       root
 89.68  2877    0       D       10239     Viruses
 89.65  2876    0       D1      2559587     Riboviria
 89.65  2876    0       O       76804         Nidovirales
 89.65  2876    0       O1      2499399         Cornidovirineae
 89.65  2876    1       F       11118             Coronaviridae
 89.62  2875    0       F1      2501931             Orthocoronavirinae
 89.62  2875    0       G       694002                Betacoronavirus
 89.62  2875    0       G1      2509511                 Sarbecovirus
 89.62  2875    4       S       694009                    Severe acute respiratory syndrome-related coronavirus
 89.50  2871    2871    S1      2697049                     Severe acute respiratory syndrome coronavirus 2
  0.03  1       0       O       28883       Caudovirales
  0.03  1       0       F       10662         Myoviridae
  0.03  1       1       G       186777          Muvirus

What's the read count in your other files?

erinyoung commented 1 year ago

Actually, I just looked at the reads. These are really long. Are you sure they are paired-end Illumina reads?

fastqc_per_base_sequence_quality_plot

erinyoung commented 1 year ago

When I treated these files as nanopore instead of Illumina, I was able to get more bases called. (Nanopore uses artic instead of ivar)

2023CV2184.consensus.fa.txt

Are these from a new fangled Illumina long-read sequencer?

DrB-S commented 1 year ago

I'll check. Meanwhile I have removed the gzipped fastqs < 10M from the comparisons. I will use --nanopore for the reads.

DrB-S commented 1 year ago

Most are ClearLabs with CL Modified Midnight Primers V3, 4 are MiSeq with Artic v5.3, and 4 are NextSeq with Artic v4.1.

erinyoung commented 1 year ago

That sounds like an awesome dataset!

Sorry to repeat myself from the conversation on SLACK, but someone else may have this same issue and I'd like for them to know how to get it resolved.

Nanopore and Illumina files can be read into the same run, but multiple primer bedfiles are not. For the best results, you'll want to run Cecret on eat set of samples divided by primer set.

The midnight primer set isn't included in this workflow as of today, but now that we've added nanopore functionality they are likely to be more common. I'll put adding the midnight primer set to my todo list for a future release.

DrB-S commented 1 year ago

What primer set is best to use with the ClearLabs sequences for the time-being?

DrB-S commented 1 year ago

I used artic_V3, with the following command-line, but it failed: nextflow run UPHL-BioNGS/Cecret -profile singularity,artic_V3 -c configs/sarscov.config --medcpus 70 --maxcpus 100 --nanopore nanopore

DrB-S commented 1 year ago

This is the failed command: Command failed:medaka consensus --model r941_min_high_g360 --threads 100 --chunk_len 800 --chunk_ovlp 400 --RG nCoV-2019_1 artic/2023CV2249.trimmed.rg.sorted.bam artic/2023CV2249.nCoV-2019_1.hdf

DrB-S commented 1 year ago

I found this web page: https://labs.epi2me.io/ont-midnight-scheme-update

excerpt: The rule of thumb for Midnight kits bought from ONT is that the kit version will always pair with the scheme version e.g. EXP-MRT001.30 should be paired with primer scheme Midnight-ONT/V3.

A table follows showing the additions for V3.

DrB-S commented 1 year ago

Here is a link for what appears to be the new 6-column bed file:

https://github.com/artic-network/primer-schemes/blob/master/nCoV-2019/V3/nCoV-2019.primer.bed, which you already have in schema

erinyoung commented 1 year ago

This is the failed command: Command failed:medaka consensus --model r941_min_high_g360 --threads 100 --chunk_len 800 --chunk_ovlp 400 --RG nCoV-2019_1 artic/2023CV2249.trimmed.rg.sorted.bam artic/2023CV2249.nCoV-2019_1.hdf

My apologies, but I need more information about this error.

What inputs did you use with the workflow? Could you share your configs/sarscov.config file? Could you share the full error? (It's possible that the container is out-of-date and more models need to be added to it.)

Also, the link you shared is for the artic V3 primers, which are different than the midnight V3 primers. I am unsure of the full implications and problems of using the "wrong" primer set, so I recommend using the primers that were used to generate the reads.

DrB-S commented 1 year ago

I reran Cecret on just the ONT reads and didn't specify the primer set: nextflow run UPHL-BioNGS/Cecret -profile singularity -c configs/sarscov.config --medcpus 70 --maxcpus 100 --nanopore nanopore

I have attached nextflow.log and sarscov.config I can't seem to find the ONT Midnight Primers V3 bed file anywhere. Let me know if you need anything else.

nextflow.log sarscov.config.txt

DrB-S commented 1 year ago

Attached is an ONT modified Midnight Primers V3 bed that I generated from the scheme file. I will try to use it in my command-line. CL_Modified_Midnight_Primers_SARS-CoV-2.primer.bed.zip

DrB-S commented 1 year ago

I incorporated the new bedfiles as follows in my CLI but Cecret errored out (log attached): nextflow run UPHL-BioNGS/Cecret -profile singularity,sarscov2 -c configs/sarscov.config --medcpus 70 --maxcpus 100 --nanopore nanopore --primer_bed configs/CL_Modified_Midnight_Primers_SARS-CoV-2.primer.bed --insert_bed configs/CL_Modified_Midnight_Primers_SARS-CoV-2.insert.bed

nextflow.log

erinyoung commented 1 year ago

I think I may have a solution, but I'm also going to check the biocontainer's artic version (1.2.3--pyhdfd78af_0) to see if that one would be best to use (artic is currently on version 1.2.4, but 1.2.3 is likely fine)

DrB-S commented 1 year ago

The nextflow.log shows the entire error. I have attached sarscov.config sarscov.config.txt

erinyoung commented 1 year ago

Can you test the new version? I'm hopeful that this container works better in the cloud.

DrB-S commented 1 year ago

Sure. I'll grab it now. Thanks!

DrB-S commented 1 year ago

nextflow pull UPHL-BioNGS/Cecret Checking UPHL-BioNGS/Cecret ... UPHL-BioNGS/Cecret contains uncommitted changes -- cannot pull from repository

DrB-S commented 1 year ago

It didn’t work because I am out of space on the drive.

Stephen M. Beckstrom-Sternberg, PhD Bioinformatics Contractor

Arizona State Public Health Lab Arizona Department of Health Services Cell: (602) 653-5011 Email: @.***

On Sep 19, 2023, at 1:42 PM, Young @.***> wrote:

Can you test the new version? I'm hopeful that this container works better in the cloud.

— Reply to this email directly, view it on GitHub https://github.com/UPHL-BioNGS/Cecret/issues/224#issuecomment-1726439402, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTVLJQRXZPP2VCITRXO4QDX3H7ULANCNFSM6AAAAAA4NUJTQM. You are receiving this because you authored the thread.

-- CONFIDENTIALITY NOTICE:  This e-mail is the property of the Arizona Department of Health Services and contains information that may be PRIVILEGED, CONFIDENTIAL, or otherwise exempt from disclosure by applicable law.  It is intended only for the person(s) to whom it is addressed.  If you have received this communication in error, please do not retain or distribute it.  Please notify the sender immediately by e-mail at the address shown above and delete the original message.  Thank you.  

erinyoung commented 1 year ago

nextflow pull UPHL-BioNGS/Cecret Checking UPHL-BioNGS/Cecret ... UPHL-BioNGS/Cecret contains uncommitted changes -- cannot pull from repository

If nextflow pull doesn't work, you probably need to remove the directory and re-pull it.

rm -rf ~/.nextflow/assets/UPHL-BioNGS/Cecret
erinyoung commented 1 year ago

As for the space, you might need to remove the 'work' directory and empty your '/tmp' directory. Sometimes tmp files aren't deleted when a nextflow workflow fails.

DrB-S commented 1 year ago

Still running out of space. We are looking into directories to delete….

Stephen M. Beckstrom-Sternberg, PhD Bioinformatics Contractor

Arizona State Public Health Lab Arizona Department of Health Services Cell: (602) 653-5011 Email: @.***

On Sep 19, 2023, at 3:58 PM, Young @.***> wrote:

As for the space, you might need to remove the 'work' directory and empty your '/tmp' directory. Sometimes tmp files aren't deleted when a nextflow workflow fails.

— Reply to this email directly, view it on GitHub https://github.com/UPHL-BioNGS/Cecret/issues/224#issuecomment-1726651338, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTVLJTPDTEPJZ3CWHUYARDX3IPR7ANCNFSM6AAAAAA4NUJTQM. You are receiving this because you authored the thread.

-- CONFIDENTIALITY NOTICE:  This e-mail is the property of the Arizona Department of Health Services and contains information that may be PRIVILEGED, CONFIDENTIAL, or otherwise exempt from disclosure by applicable law.  It is intended only for the person(s) to whom it is addressed.  If you have received this communication in error, please do not retain or distribute it.  Please notify the sender immediately by e-mail at the address shown above and delete the original message.  Thank you.  

erinyoung commented 1 year ago

I have a thought. What version of singularity are you using?

DrB-S commented 1 year ago

singularity version 3.8.7-1.el7

DrB-S commented 1 year ago

The nanopore reads ran fine with the new version of Cecret.

erinyoung commented 1 year ago

Oh good! I've been so worried!!!!