Closed DrB-S closed 1 year ago
Oh no!
This is from https://github.com/lh3/bwa/issues/228#issuecomment-534769169 , which may apply here
If bwa is terminated with this error, it usually signifies that pair reads in both FASTQ files are disordered.
From BBTools Repair Guide:
With paired reads in 2 files, the first read in file 1 must be the mate of the first read in file 2, etc. For paired reads in a single interleaved file, the second read is the mate of the first read, and the 4th read is the mate of the 3rd read, etc.
There are few solutions which worked for me.
Sort FASTQ files
BBTools repair.sh is supposed to do that, but it seems it cannot fix large datasets as it loads whole file to RAM.
FASTQ-SORT is way better for large FASTQs. It buffers RAM on hard drive, it successfully sorted ~ 150GB file.
Use another aligner
Bowtie 2 doesn't require the condition of mates in the same order, but while aligning disordered FASTQs with Bowtie 2 (as of my experience) you get less mapped reads.
So you can try to repair your fastq files with bbtools' repair.sh
or trying use minimap for alignment (set params.aligner = 'minimap2'
)
You could also try just keeping the files interleaved and treating them as single reads. It's not "as good", but should "still work".
Thanks! I'll try minimap2 first on the non-interleaved files.
It is running nicely with minimap2!
VICTORY!!!
I ran twice using minimap2:
I have a lot of questions!
cecret/consensus/*
)The vadr error is suggestive that your fasta files may be empty.
It's weird that ACI worked the second time, so I wonder if it's a memory issue.
I generally like to use
nextflow run UPHL-BIoNGS/Cecret --reads <directory to reads>
Nextflow sometimes has issues with complicated filenames with multiple numbers in them, so I wonder if your input files are getting paired correctly. If you look in the output/log files for fastqc and seqyclean, does it look like it should? If not, you may need to use a sample sheet to get your reads into the workflow.
Here is the command-line for both:
nextflow run UPHL-BioNGS/Cecret -profile singularity -c configs/sarscov.config --medcpus 70 --maxcpus 100 --reads reads --relatedness true --msa nextalign --aligner minimp2
What do your fastqc or seqyclean output files/log files look like?
I see the problem. There are 2 consensus files for each sample, rather than assembling the fastqs as read-pairs. Here is an example: -rw-rw-r-- 1 becksts becksts 30111 Sep 6 12:38 2023CV1724.BC4L11.2023-07-13.01.D01.consensus.fa -rw-rw-r-- 1 becksts becksts 29685 Sep 6 10:47 2023CV1724.BC4L11.2023-07-13.01.D01_R.consensus.fa
There is a problem with the fastq file names after deinterleaving: 2023CV1724.BC4L11.2023-07-13.01.D01_R1.fastq 2023CV1724.BC4L11.2023-07-13.01.D01_R2.fastq
The same problem happens when using the .fq ending.
My suspicion is that they may not actually be read-pairs at all, and that I will need to use the original fastqs....
It might be the file names.
In my mind 2023CV1724.BC4L11.2023-07-13.01.D01_R1.fastq
would obviously get paired with 2023CV1724.BC4L11.2023-07-13.01.D01_R2.fastq
, but nextflow might pair it was 2023CV1714.BC4L11.2023-07-13.01.D01_R1.fastq
.
If you can't find the original fastq files, I recommend using a sample sheet. https://github.com/UPHL-BioNGS/Cecret#using-a-sample-sheet (and hopefully that resolves the issue)
I have the originals. I'll run those first.
I ran the original fastqs using the originals with the following command: nextflow run UPHL-BioNGS/Cecret -profile singularity -c configs/sarscov.config --medcpus 70 --maxcpus 100 --single_reads single_reads --relatedness true --msa nextalign --aligner minimap2
It ran all the way through to the summary and died. The consensus files are predominantly N's, so they don't appear to be matching with MN908947. I used 'single_reads' because the deinterleaved fastqs did not work (was this a problem?). A snp matrix is produced, which shows that they are all very similar to each other and to MN908947....
I have attached amplicon_depth.csv from the aci directory: amplicon_depth.csv.zip
It died on the summary!!!?!?!?!
It shouldn't be a problem to run them as single-reads. It's just not as good.
Can you share a fastq file with me?
There's gotta be something I'm not looking for.
When I use the --reads flag, cecret quits, immediately after starting. But when I use --single_reads, it runs. I have it running one more time....
Attached is the smallest fastq.gz (1.8M). The largest is 141M. 2023CV2184.BC4L11.2023-08-22.01.H02.fastq.gz
This isn't going to copy well, but this is a copy of the *hist file from the samtools coverage process. This file doesn't have enough reads. Right now the default mean depth for ivar to call variants is 100X, and this was a mean depth of less than 10 across the entire genome.
MN908947.3 (29.9Kbp)
> 89.70% │ ▁ ▆ █ ▅ │ Number of reads: 313
> 79.73% │ █ █ █ █ ▂ █ ▆ │
> 69.77% │▁ ██ █ █ ▁ █ █ ▇ █ █ │ Covered bases: 12.4Kbp
> 59.80% │█ ██ █▅ ▃█ █▂█ █ █▂▅ █▄ ▅█ │ Percent covered: 41.37%
> 49.83% │█ ██ ██ ██ ███ █▅ ███ ▁ ██ ██ │ Mean coverage: 0.963x
> 39.87% │█ ██ ▂ ██ ██ ███ ██ ███ █ ██ ██ │ Mean baseQ: 31.2
> 29.90% │█ ██ ▄ █▅███ ██▄ ███▄ ██ ███ █ ██ ██▇ │ Mean mapQ: 60
> 19.93% │█ ██ █▂ █████▁███▇████▇████▇███ █ ▇ ██ ███ │
> 9.97% │█ ▁██▃██ ▄███████████████████████▇▁█ █ ▅ ██ ███ │ Histo bin width: 598bp
> 0.00% │█ ██████ ███████████████████████████▂█ █ ██ ███ │ Histo max bin: 99.666%
1 6.0K 12.0K 17.9K 23.9K 29.9K
I think there's less than 3500 reads in that fastq file.
Good news, though, most of the reads were SARS-CoV-2. Here's the Kraken2 report:
$ cat cecret/kraken2/2023CV2184_kraken2_report.txt
10.32 331 331 U 0 unclassified
89.68 2877 0 R 1 root
89.68 2877 0 D 10239 Viruses
89.65 2876 0 D1 2559587 Riboviria
89.65 2876 0 O 76804 Nidovirales
89.65 2876 0 O1 2499399 Cornidovirineae
89.65 2876 1 F 11118 Coronaviridae
89.62 2875 0 F1 2501931 Orthocoronavirinae
89.62 2875 0 G 694002 Betacoronavirus
89.62 2875 0 G1 2509511 Sarbecovirus
89.62 2875 4 S 694009 Severe acute respiratory syndrome-related coronavirus
89.50 2871 2871 S1 2697049 Severe acute respiratory syndrome coronavirus 2
0.03 1 0 O 28883 Caudovirales
0.03 1 0 F 10662 Myoviridae
0.03 1 1 G 186777 Muvirus
What's the read count in your other files?
Actually, I just looked at the reads. These are really long. Are you sure they are paired-end Illumina reads?
When I treated these files as nanopore instead of Illumina, I was able to get more bases called. (Nanopore uses artic instead of ivar)
Are these from a new fangled Illumina long-read sequencer?
I'll check. Meanwhile I have removed the gzipped fastqs < 10M from the comparisons. I will use --nanopore for the reads.
Most are ClearLabs with CL Modified Midnight Primers V3, 4 are MiSeq with Artic v5.3, and 4 are NextSeq with Artic v4.1.
That sounds like an awesome dataset!
Sorry to repeat myself from the conversation on SLACK, but someone else may have this same issue and I'd like for them to know how to get it resolved.
Nanopore and Illumina files can be read into the same run, but multiple primer bedfiles are not. For the best results, you'll want to run Cecret on eat set of samples divided by primer set.
The midnight primer set isn't included in this workflow as of today, but now that we've added nanopore functionality they are likely to be more common. I'll put adding the midnight primer set to my todo list for a future release.
What primer set is best to use with the ClearLabs sequences for the time-being?
I used artic_V3, with the following command-line, but it failed: nextflow run UPHL-BioNGS/Cecret -profile singularity,artic_V3 -c configs/sarscov.config --medcpus 70 --maxcpus 100 --nanopore nanopore
This is the failed command: Command failed:medaka consensus --model r941_min_high_g360 --threads 100 --chunk_len 800 --chunk_ovlp 400 --RG nCoV-2019_1 artic/2023CV2249.trimmed.rg.sorted.bam artic/2023CV2249.nCoV-2019_1.hdf
I found this web page: https://labs.epi2me.io/ont-midnight-scheme-update
excerpt: The rule of thumb for Midnight kits bought from ONT is that the kit version will always pair with the scheme version e.g. EXP-MRT001.30 should be paired with primer scheme Midnight-ONT/V3.
A table follows showing the additions for V3.
Here is a link for what appears to be the new 6-column bed file:
https://github.com/artic-network/primer-schemes/blob/master/nCoV-2019/V3/nCoV-2019.primer.bed, which you already have in schema
This is the failed command: Command failed:medaka consensus --model r941_min_high_g360 --threads 100 --chunk_len 800 --chunk_ovlp 400 --RG nCoV-2019_1 artic/2023CV2249.trimmed.rg.sorted.bam artic/2023CV2249.nCoV-2019_1.hdf
My apologies, but I need more information about this error.
What inputs did you use with the workflow? Could you share your configs/sarscov.config
file? Could you share the full error? (It's possible that the container is out-of-date and more models need to be added to it.)
Also, the link you shared is for the artic V3 primers, which are different than the midnight V3 primers. I am unsure of the full implications and problems of using the "wrong" primer set, so I recommend using the primers that were used to generate the reads.
I reran Cecret on just the ONT reads and didn't specify the primer set: nextflow run UPHL-BioNGS/Cecret -profile singularity -c configs/sarscov.config --medcpus 70 --maxcpus 100 --nanopore nanopore
I have attached nextflow.log and sarscov.config I can't seem to find the ONT Midnight Primers V3 bed file anywhere. Let me know if you need anything else.
Attached is an ONT modified Midnight Primers V3 bed that I generated from the scheme file. I will try to use it in my command-line. CL_Modified_Midnight_Primers_SARS-CoV-2.primer.bed.zip
I incorporated the new bedfiles as follows in my CLI but Cecret errored out (log attached): nextflow run UPHL-BioNGS/Cecret -profile singularity,sarscov2 -c configs/sarscov.config --medcpus 70 --maxcpus 100 --nanopore nanopore --primer_bed configs/CL_Modified_Midnight_Primers_SARS-CoV-2.primer.bed --insert_bed configs/CL_Modified_Midnight_Primers_SARS-CoV-2.insert.bed
I think I may have a solution, but I'm also going to check the biocontainer's artic version (1.2.3--pyhdfd78af_0) to see if that one would be best to use (artic is currently on version 1.2.4, but 1.2.3 is likely fine)
The nextflow.log shows the entire error. I have attached sarscov.config sarscov.config.txt
Can you test the new version? I'm hopeful that this container works better in the cloud.
Sure. I'll grab it now. Thanks!
nextflow pull UPHL-BioNGS/Cecret Checking UPHL-BioNGS/Cecret ... UPHL-BioNGS/Cecret contains uncommitted changes -- cannot pull from repository
It didn’t work because I am out of space on the drive.
Stephen M. Beckstrom-Sternberg, PhD Bioinformatics Contractor
Arizona State Public Health Lab Arizona Department of Health Services Cell: (602) 653-5011 Email: @.***
On Sep 19, 2023, at 1:42 PM, Young @.***> wrote:
Can you test the new version? I'm hopeful that this container works better in the cloud.
— Reply to this email directly, view it on GitHub https://github.com/UPHL-BioNGS/Cecret/issues/224#issuecomment-1726439402, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTVLJQRXZPP2VCITRXO4QDX3H7ULANCNFSM6AAAAAA4NUJTQM. You are receiving this because you authored the thread.
-- CONFIDENTIALITY NOTICE: This e-mail is the property of the Arizona Department of Health Services and contains information that may be PRIVILEGED, CONFIDENTIAL, or otherwise exempt from disclosure by applicable law. It is intended only for the person(s) to whom it is addressed. If you have received this communication in error, please do not retain or distribute it. Please notify the sender immediately by e-mail at the address shown above and delete the original message. Thank you.
nextflow pull UPHL-BioNGS/Cecret Checking UPHL-BioNGS/Cecret ... UPHL-BioNGS/Cecret contains uncommitted changes -- cannot pull from repository
If nextflow pull doesn't work, you probably need to remove the directory and re-pull it.
rm -rf ~/.nextflow/assets/UPHL-BioNGS/Cecret
As for the space, you might need to remove the 'work' directory and empty your '/tmp' directory. Sometimes tmp files aren't deleted when a nextflow workflow fails.
Still running out of space. We are looking into directories to delete….
Stephen M. Beckstrom-Sternberg, PhD Bioinformatics Contractor
Arizona State Public Health Lab Arizona Department of Health Services Cell: (602) 653-5011 Email: @.***
On Sep 19, 2023, at 3:58 PM, Young @.***> wrote:
As for the space, you might need to remove the 'work' directory and empty your '/tmp' directory. Sometimes tmp files aren't deleted when a nextflow workflow fails.
— Reply to this email directly, view it on GitHub https://github.com/UPHL-BioNGS/Cecret/issues/224#issuecomment-1726651338, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTVLJTPDTEPJZ3CWHUYARDX3IPR7ANCNFSM6AAAAAA4NUJTQM. You are receiving this because you authored the thread.
-- CONFIDENTIALITY NOTICE: This e-mail is the property of the Arizona Department of Health Services and contains information that may be PRIVILEGED, CONFIDENTIAL, or otherwise exempt from disclosure by applicable law. It is intended only for the person(s) to whom it is addressed. If you have received this communication in error, please do not retain or distribute it. Please notify the sender immediately by e-mail at the address shown above and delete the original message. Thank you.
I have a thought. What version of singularity are you using?
singularity version 3.8.7-1.el7
The nanopore reads ran fine with the new version of Cecret.
Oh good! I've been so worried!!!!
I am running the latest version of Cecret on paired-end fastqs as follows: nextflow run UPHL-BioNGS/Cecret -profile singularity -c configs/sarscov.config --medcpus 70 --maxcpus 100 --reads reads --relatedness true --msa nextalign
The reads files all end with _R(1,2}.fastq A typical paired name is: 2023CV1635.BC4L11.2023-06-29.01.C03_R1.fastq and 2023CV1635.BC4L11.2023-06-29.01.C03_R2.fastq
They were derived from an interleaved fastq file (2023CV1635.BC4L11.2023-06-29.01.C03.fastq.gz) after unzipping using the following:
fastaq deinterleave 2023CV1635.BC4L11.2023-06-29.01.C03.fastq 2023CV1635.BC4L11.2023-06-29.01.C03_R1.fastq 2023CV1635.BC4L11.2023-06-29.01.C03_R2.fastq
Here is the error: ERROR ~ Error executing process > 'CECRET:cecret:bwa (2023CV1635.BC4L11.2023-06-29.01.C03)'
Caused by: Process
CECRET:cecret:bwa (2023CV1635.BC4L11.2023-06-29.01.C03)
terminated with an error exit status (1)Command executed:
mkdir -p bwa logs/CECRET:cecret:bwa log=logs/CECRET:cecret:bwa/2023CV1635.BC4L11.2023-06-29.01.C03.2e441589-1655-44d3-8de3-d34884116268.log
time stamp + capturing tool versions
date > $log echo "bwa $(bwa 2>&1 | grep Version )" >> $log bwa_version="bwa : "$(bwa 2>&1 | grep Version)
index the reference fasta file
bwa index MN908947.3.fasta
bwa mem command
bwa mem -t 100 MN908947.3.fasta 2023CV1635.BC4L11.2023-06-29.01.C03_clean_PE1.fastq.gz 2023CV1635.BC4L11.2023-06-29.01.C03_clean_PE2.fastq.gz > bwa/2023CV1635.BC4L11.2023-06-29.01.C03.sam
Command exit status: 1
Command output: (empty)
Command error: [bwa_index] Pack FASTA... 0.00 sec [bwa_index] Construct BWT for the packed sequence... [bwa_index] 0.01 seconds elapse. [bwa_index] Update BWT... 0.00 sec [bwa_index] Pack forward-only FASTA... 0.00 sec [bwa_index] Construct SA from BWT and Occ... 0.00 sec [main] Version: 0.7.17-r1188 [main] CMD: bwa index MN908947.3.fasta [main] Real time: 0.028 sec; CPU: 0.017 sec [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 3830 sequences (550465 bp)... [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (505, 0, 0, 501) [M::mem_pestat] analyzing insert size distribution for orientation FF... [M::mem_pestat] (25, 50, 75) percentile: (709, 3832, 7080) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 19822) [M::mem_pestat] mean and std.dev: (4089.34, 3196.52) [M::mem_pestat] low and high boundaries for proper pairs: (1, 26193) [M::mem_pestat] skip orientation FR as there are not enough pairs [M::mem_pestat] skip orientation RF as there are not enough pairs [M::mem_pestat] analyzing insert size distribution for orientation RR... [M::mem_pestat] (25, 50, 75) percentile: (1059, 3892, 7015) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 18927) [M::mem_pestat] mean and std.dev: (4197.31, 3100.22) [M::mem_pestat] low and high boundaries for proper pairs: (1, 24883) [mem_sam_pe] paired reads have different names: "00056560-4cc8-4869-a2f5-855b469d586d", "001c0b61-b674-4486-81fa-7ffac2957da1"
Work dir: /data/Sequence_analysis/Cecret/Analyses/Covid_Terra_6Sep23/work/33/d5997c0f46ce6f7cdd83b12ec362af
=== NOTE: The same error occurred when I did not deinterleave the files first!