Closed DrB-S closed 1 year ago
The sed command should have a (*) after each dot, but it didn't render that way above!!
Could you share your command and config file (if applicable)?
Sure Command: nextflow run UPHL-BioNGS/Cecret -profile singularity,mpx -c configs/monkeypox.config --outdir 2Aug_removed26_27_plusTower
Cecret config file: monkeypox.config.txt
Since you are defining species in your config file, there's no need to use the mpx profile.
nextflow run UPHL-BioNGS/Cecret -profile singularity -c configs/monkeypox.config --outdir 2Aug_removed26_27_plusTower
I've removed all the commented out lines in your file for my sake.
tower.accessToken = '<removed>'
tower.enabled = true
params.reads = 'reads'
params.fastas = 'fastas'
params.outdir = 'cecret'
params.maxcpus = 8
params.medcpus = 4
params.reference_genome = "/data/Sequence_analysis/Cecret/Analyses/monkeypox/references/NC_063383.1.fasta"
params.gff_file = "/data/Sequence_analysis/Cecret/Analysis/monkeypox/references/NC_063383.1.gff3"
params.species = 'mpx'
params.cleaner = 'fastp'
params.aligner = 'minimap2'
params.msa = 'nextalign'
fastqc_container = 'staphb/fastqc:0.11.9'
seqyclean_container = 'staphb/seqyclean:1.10.09'
fastp_container = 'staphb/fastp:0.23.2'
bwa_container = 'staphb/bwa:0.7.17'
minimap2_container = 'staphb/minimap2:2.24'
samtools_container = 'staphb/samtools:1.15'
ivar_container = 'staphb/ivar:1.3.1'
bcftools_container = 'staphb/bcftools:1.15'
kraken2_container = 'staphb/kraken2:2.1.2-no-db'
bedtools_container = 'staphb/bedtools:2.30.0'
pangolin_container = 'staphb/pangolin:latest'
nextclade_container = 'nextstrain/nextclade:2.3.0'
vadr_container = 'staphb/vadr:latest'
parallel_perl_container = 'staphb/parallel-perl:20200722'
mafft_container = 'staphb/mafft:7.475'
nextalign_container = 'nextstrain/nextalign:latest'
snp_dists_container = 'staphb/snp-dists:0.8.2'
iqtree2_container = 'staphb/iqtree2:2.1.2'
pandas_container = 'quay.io/biocontainers/pandas:1.1.5'
multiqc_container = 'quay.io/biocontainers/multiqc:1.12--pyhdfd78af_0'
freyja_container = 'staphb/freyja:1.3.4'
params.cleaner = 'fastp'
params.fastp_options = ''
params.aligner = 'minimap2'
params.minimap2_options = '-K 20M'
params.filter = true
params.trimmer = 'none'
params.ivar_variants = true
params.ivar_consensus = true
params.minimum_depth = 10
params.mpileup_depth = 8000
params.ivar_consensus_options = '-q 20 -t 0.6 -n N'
params.bcftools_variants = true
params.markdup = 'true'
params.kraken2_options = ''
params.kraken2 = true
params.kraken2_db = '/data/Sequence_analysis/databases/kraken2_db/kraken2_PlusPF-16'
params.kraken2_organism = "Monkeypox virus"
params.bedtools_multicov = false
params.samtools_ampliconstats = false
params.samtools_plot_ampliconstats = false
params.pangolin = false
params.freya = false
params.freyja_aggregate = false
params.nextclade_dataset = 'hMPXV_B1'
params.nextclade_dataset_tag = '2022-06-29T12:00:00'
params.nextclade = true
params.vadr_options = '--split --cpu 20 --glsearch -s -r --nomisc --mkey mpxv --r_lowsimok --r_lowsimxd 100 --r_lowsimxl 2000 --alt_pass discontn,dupregin --mdir /usr/local/vadr-models-mpxv-1.4.2-1 mpxv.3.fa my3'
params.vadr_reference = 'monkeypox'
params.vadr_trim_options = '--minlen 50 --maxlen 30000'
params.vadr = true
params.msa = 'nextalign'
params.nextalign_options = '--genes OPG001,OPG002,OPG003,OPG005,OPG015,OPG016,OPG019,OPG021,OPG022,OPG023,OPG024,OPG025,OPG027,OPG029,OPG030,OPG031,OPG034,OPG035,OPG036,OPG037,OPG038,OPG039,OPG040,OPG042,OPG043,OPG044,OPG045,OPG046,OPG047,OPG048,OPG049,OPG050,OPG051,OPG052,OPG053,OPG054,OPG055,OPG056,OPG057,OPG058,OPG059,OPG060,OPG061,OPG062,OPG063,OPG064,OPG065,OPG066,OPG068,OPG069,OPG070,OPG071,OPG072,OPG073,OPG074,OPG075,OPG076,OPG077,OPG078,OPG079,OPG080,OPG081,OPG082,OPG083,OPG084,OPG085,OPG086,OPG087,OPG088,OPG089,OPG090,OPG091,OPG092,OPG093,OPG094,OPG095,OPG096,OPG097,OPG098,OPG099,OPG100,OPG101,OPG102,OPG103,OPG104,OPG105,OPG106,OPG107,OPG108,OPG109,OPG110,OPG111,OPG112,OPG113,OPG114,OPG115,OPG116,OPG117,OPG118,OPG119,OPG120,OPG121,OPG122,OPG123,OPG124,OPG125,OPG126,OPG127,OPG128,OPG129,OPG130,OPG131,OPG132,OPG133,OPG134,OPG135,OPG136,OPG137,OPG138,OPG139,OPG140,OPG141,OPG142,OPG143,OPG144,OPG145,OPG146,OPG147,OPG148,OPG149,OPG150,OPG151,OPG153,OPG154,OPG155,OPG156,OPG157,OPG158,OPG159,OPG160,OPG161,OPG162,OPG163,OPG164,OPG165,OPG166,OPG167,OPG170,OPG171,OPG172,OPG173,OPG174,OPG175,OPG176,OPG178,OPG180,OPG181,OPG185,OPG187,OPG188,OPG189,OPG190,OPG191,OPG192,OPG193,OPG195,OPG197,OPG198,OPG199,OPG200,OPG204,OPG205,OPG208,OPG209,OPG210 --include-reference'
params.relatedness = true
params.snpdists_options = '-c'
params.snpdists = true
params.iqtree2_options = '-ninit 2 -n 2 -me 0.05 -m GTR -o NC_063383.1'
params.iqtree2 = false
process {
publishDir = [ path: params.outdir, mode: 'copy' ]
errorStrategy = 'retry'
maxRetries = 1
withLabel: maxcpus {
cpus = params.maxcpus
}
withLabel: medcpus {
cpus = params.medcpus
}
withName:fastqc{
container = fastqc_container
}
withName:seqyclean{
container = seqyclean_container
}
withName:fastp{
container = fastp_container
}
withName:bwa{
container = bwa_container
}
withName:minimap2{
container = minimap2_container
}
withName:'ivar_.*'{
container = ivar_container
}
withName:ivar_consensus{
memory = '8 GB'
container = ivar_container
}
withName:fasta_prep{
container = pandas_container
}
withName:bcftools_variants{
container = bcftools_container
}
withName:kraken2{
container = kraken2_container
}
withName:bedtools_multicov{
container = bedtools_container
}
withName:'samtools_.*'{
container = samtools_container
}
withName:pangolin{
container = pangolin_container
}
withName:'freyja.*'{
container = freyja_container
}
withName:nextclade{
container = nextclade_container
}
withName:vadr{
container = 'staphb/vadr:latest'
errorStrategy = 'ignore'
}
withName:summary{
container = pandas_container
}
withName:combine_results{
container = pandas_container
}
withName:mafft{
container = mafft_container
errorStrategy = 'retry'
maxRetries = 2
}
withName:nextalign{
container = nextalign_container
}
withName:snpdists{
container = snp_dists_container
}
withName:iqtree2{
container = iqtree2_container
}
withName:'multiqc.*'{
container = multiqc_container
}
}
Thanks!
Here are my observations
params.iqtree2 = false
, so iqtree2 shouldn't be running at all. I need to test this.
params.msa = 'nextalign'
, params.cleaner = 'fastp'
are defined twice. They only need to be defined once.
params.iqtree2_options = '-ninit 2 -n 2 -me 0.05 -m GTR -o NC_063383.1'
will fail if the following is not also changed: params.iqtree2_outgroup = ''
params.nextclade_dataset = 'hMPXV_B1'
is different than the default, and will have a different outgroup. You'll need to adjust your outgroup to this dataset.
params.nextclade_dataset_tag = '2022-06-29T12:00:00'
is not used
params.vadr_options = '--split --cpu 20 --glsearch -s -r --nomisc --mkey mpxv --r_lowsimok --r_lowsimxd 100 --r_lowsimxl 2000 --alt_pass discontn,dupregin --mdir /usr/local/vadr-models-mpxv-1.4.2-1 mpxv.3.fa my3'
will probably fail because cpu is getting defined twice, among other things. You'll actually want to change the process options below
withName:vadr{
container = 'staphb/vadr:latest'
errorStrategy = 'ignore'
cpus = 20
}
if you want use more cpus for this process.
These are the recommended paramaters for using vadr with mpx
params.vadr_options = '--split --glsearch -s -r --nomisc --r_lowsimok --r_lowsimxd 100 --r_lowsimxl 2000 --alt_pass discontn,dupregin'
params.vadr_reference = 'mpxv'
params.vadr_trim_options = '--minlen 50 --maxlen 30000'
will trim your sequences to 30,000 bases. This is fine for SARS-CoV-2, but Monkeypox is much larger. I recommend using params.vadr_trim_options = '--minlen 50 --maxlen 210000'
Also, what's wrong with nextclade 2.4?
On Aug 3, 2022, at 1:22 PM, Young @.***> wrote:
First thing, params.iqtree2 = false, so iqtree2 shouldn't be running at all. I need to test this.
I set params.iqtree2 to false to finish the run, since it was erroring out for iqtree2. Removing mpx from the command-line profile solved the problem and I have now set it to true.
params.msa = 'nextalign' , params.cleaner = 'fastp' are defined twice. They only need to be defined once.
I removed the 2nd definition for each.
params.nextclade_dataset = 'hMPXV_B1' is different than the default, and will have a different outgroup. You'll need to adjust your outgroup to this dataset.
I changed it back to ‘MPXV’. I had changed to hMPXV_B1 because it was mentioned in in the Theiagen Term office hours yesterday.
This parameter isn't used. params.nextclade_dataset_tag = '2022-06-29T12:00:00'
I removed the dataset tag
params.vadr_options = '--split --cpu 20 --glsearch -s -r --nomisc --mkey mpxv --r_lowsimok --r_lowsimxd 100 --r_lowsimxl 2000 --alt_pass discontn,dupregin --mdir /usr/local/vadr-models-mpxv-1.4.2-1 mpxv.3.fa my3' will probably fail because cpu is getting defined twice. You'll actually want to change the process options below
withName:vadr{ container = 'staphb/vadr:latest' errorStrategy = 'ignore' cpus = 20 }
I removed the vadr options and added cpus = 20 to withName:vadr
Thanks!
Stephen M. Beckstrom-Sternberg, PhD Bioinformatics Contractor
Arizona State Public Health Lab Arizona Department of Health Services Cell: (602) 653-5011 Email: @.***
— Reply to this email directly, view it on GitHub https://github.com/UPHL-BioNGS/Cecret/issues/113#issuecomment-1204436035, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTVLJRY7U3LABKDKHT4NJLVXLIJNANCNFSM55P3N7EA. You are receiving this because you authored the thread.
-- CONFIDENTIALITY NOTICE: This e-mail is the property of the Arizona Department of Health Services and contains information that may be PRIVILEGED, CONFIDENTIAL, or otherwise exempt from disclosure by applicable law. It is intended only for the person(s) to whom it is addressed. If you have received this communication in error, please do not retain or distribute it. Please notify the sender immediately by e-mail at the address shown above and delete the original message. Thank you.
I’ll try changing it back to nextstrain/nextclade:latest
Stephen M. Beckstrom-Sternberg, PhD Bioinformatics Contractor
Arizona State Public Health Lab Arizona Department of Health Services Cell: (602) 653-5011 Email: @.***
On Aug 3, 2022, at 1:34 PM, Young @.***> wrote:
Also, what's wrong with nextclade 2.4?
— Reply to this email directly, view it on GitHub https://github.com/UPHL-BioNGS/Cecret/issues/113#issuecomment-1204454146, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTVLJWXKCQ2YKXYQKNGM2LVXLJUPANCNFSM55P3N7EA. You are receiving this because you authored the thread.
-- CONFIDENTIALITY NOTICE: This e-mail is the property of the Arizona Department of Health Services and contains information that may be PRIVILEGED, CONFIDENTIAL, or otherwise exempt from disclosure by applicable law. It is intended only for the person(s) to whom it is addressed. If you have received this communication in error, please do not retain or distribute it. Please notify the sender immediately by e-mail at the address shown above and delete the original message. Thank you.
We are still getting the same iqtree2 failure. Header is still ">NC_063383 ref_in_coord Reference sequence in coord.fasta coordinates" Instead of ">NC_063383"
I think I have found where the iqtree2 reference header problem is coming from: modules/iqtree2.nf, line 27 cat !{msa} | sed 's/!{params.iqtree2_outgroup}.*/!{params.iqtree2_outgroup}/g' > !{msa}.renamed
The nextalign.aligned.fasta header is: nextalign.aligned.fasta header
ref_in_coord Reference sequence in coord.fasta coordinates
We want to modify it to this:
NC_063383.1
So the sed command should be:
cat !{msa} | sed 's/ref_in_coord..*$/!{params.iqtree2_outgroup}/g' > !{msa}.renamed
The header for using the 'hMPXV' dataset from nexclade is >NC_063383.1 Monkeypox virus, complete genome
When params.iqtree2_outgroup = 'NC_063383'
, the following command
cat !{msa} | sed 's/!{params.iqtree2_outgroup}.*/!{params.iqtree2_outgroup}/g' > !{msa}.renamed
becomes
cat !{msa} | sed 's/NC_063383.*/NC_063383/g' > !{msa}.renamed
This turns
>NC_063383.1 Monkeypox virus, complete genome
into >NC_063383
This command should also work for ">NC_063383 ref_in_coord Reference sequence in coord.fasta coordinates"
$ echo ">NC_063383 ref_in_coord Reference sequence in coord.fasta coordinates" | sed 's/NC_063383.*/NC_063383/g'
>NC_063383
What are your 'params.iqtree2_outgroup' and 'params.iqtree2_options' values?
params.iqtree2_outgroup is incorporated into params.iqtree2_options
params.iqtree2_options = '-ninit 2 -n 2 -me 0.05 -m GTR -o NC_063383.1'
Do I need to remove the outgroup (-o NC_063383.1) from params.iqtree2_options and instead use params.iqtree2_outgroup?
And do I need to use
params.nextclade_dataset = 'hMPXV' rather than params.nextclade_dataset = 'MPXV'
Right now, if 'params.iqtree2_outgroup' is empty, nothing happens to the header.
I've found IQTREE2 is a bit finicky about having spaces in the outgroup (or WAS, maybe it's been fixed now), which is why I added in the sed
command to fix nextclade's reference.
Do I need to remove the outgroup (-o NC_063383.1) from params.iqtree2_options and instead use params.iqtree2_outgroup?
I recommend this.
params.iqtree2_options = '-ninit 2 -n 2 -me 0.05 -m GTR'
params.iqtree2_outgroup = 'NC_063383.1'
And do I need to use
params.nextclade_dataset = 'hMPXV' rather than params.nextclade_dataset = 'MPXV'
The reference may be different depending on what dataset you use from nextclade, but I think both of these use 'NC_063383.1'
iqtree2 is still failing. We are still getting the wrong header in nextalign.aligned.fasta & nextalign.aligned.fasta.renamed
">ref_in_coord Reference sequence in coord.fasta coordinates"
I have now looked at this closer. I assumed that nextclade would keep the accession number in their reference fasta.
The reference file for 'MPXV' is 'ref_in_coord', so you'll want to set your outgroup to
params.iqtree2_outgroup = 'ref_in_coord'
Or you could set it to params.iqtree2_outgroup = ''
to avoid the outgroup option altogether.
So it will be included in the tree as ‘ref_in_coord’? That is the way it shows up in the snp matrix.
If I avoid the outgroup option, will it still show up in the tree and matrix?
Stephen M. Beckstrom-Sternberg, PhD Bioinformatics Contractor
Arizona State Public Health Lab Arizona Department of Health Services Cell: (602) 653-5011 Email: @.***
On Aug 5, 2022, at 1:24 PM, Young @.***> wrote:
I have now looked at this closer. I assumed that nextclade would keep the accession number in their reference fasta.
The reference file for 'MPXV' is 'ref_in_coord', so you'll want to set your outgroup to params.iqtree2_outgroup = 'ref_in_coord'
Or you could set it to params.iqtree2_outgroup = '' to avoid the outgroup option altogether.
— Reply to this email directly, view it on GitHub https://github.com/UPHL-BioNGS/Cecret/issues/113#issuecomment-1206834744, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTVLJT6OBPYXPYSFSW52CTVXV2BNANCNFSM55P3N7EA. You are receiving this because you authored the thread.
-- CONFIDENTIALITY NOTICE: This e-mail is the property of the Arizona Department of Health Services and contains information that may be PRIVILEGED, CONFIDENTIAL, or otherwise exempt from disclosure by applicable law. It is intended only for the person(s) to whom it is addressed. If you have received this communication in error, please do not retain or distribute it. Please notify the sender immediately by e-mail at the address shown above and delete the original message. Thank you.
If you leave out the outgroup, it will still show up in the tree.
I was able to get Cecret to finish, including iqtree2, by setting the following in configs/cecret.config (this is the working version of dev/monkeypox_steve.config):
//#Removed 5 Aug 22 as per Erin - params.iqtree2_outgroup = 'NC_063383.1' params.iqtree2_outgroup = ''
Commented out the species flag //params.species = 'mpx' //#Turn this on if not using mpx flag
Included the mpx profile on the command-line
(The alternative is to uncomment the params.species line and NOT include the mpx flag on the command-line)
==
There is still erroneous output in the following: iqtree2 nextalign/nextalign.fasta snp_dists dataset/reference.fasta logs/msa:iqtree2/msa:iqtree2..
They all contain "ref_in_coord..*', rather than 'NC_063383.1'
These can be found by running the following command In the Analysis dir:
grep 'ref_in_coord' */*
Here are the errors encountered: ref_in_coord_ERRORS.txt
This is the final tree. It looks like it should. The ref_in_coord header is from the nextclade group.
iqtree2/iqtree2.iqtree:(ref_in_coord:0.0017018305,((Consensus_ERR3485802.consensus_threshold_0.6_quality_20:0.0000005071,Consensus_ERR3485801.consensus_threshold_0.6_quality_20:0.0000005071):0.0000384397,(((Consensus_ERR3485797.consensus_threshold_0.6_quality_20:0.0000005071,Consensus_ERR3485799.consensus_threshold_0.6_quality_20:0.0000005071):0.0000005071,Consensus_ERR3485800.consensus_threshold_0.6_quality_20:0.0000005071):0.0000005071,Consensus_ERR3485798.consensus_threshold_0.6_quality_20:0.0000005071):0.0000537884):0.0008604210,(MPXV-UK_P3_completeGenome:0.0000005071,(Consensus_SRR19536729.consensus_threshold_0.6_quality_20:0.0000005071,Consensus_SRR19536728.consensus_threshold_0.6_quality_20:0.0000005071):0.0002536239):0.0011492286);
I'll look at this again tomorrow.
I thought that nextclade would use 'NC_063383.1' for both hMPXV and MPXV, but it appears that the reference for MPXV is 'ref_in_coord'. There's not a lot that I can do about that.
I've put in an issue with the nextstrain team (https://github.com/nextstrain/nextclade_data/issues/35), but this is going to be something that they're going to have to fix.
As a work-around until they fix this at nextstrain, is there a way to stop the Cecret pipeline after it creates dataset/reference.fasta, change the header to '>NC_063383.1', and resume the pipeline?
I hesitate to do that because it doesn't look like the sequence is NC_063383.1. It looks like a predicted ancestral sequence with the most recent one being 3 days ago (https://github.com/nextstrain/nextclade_data/tree/master/data/datasets/MPXV/references/ancestral/versions).
Does this mean the ancestral sequence will continue changing? And will reference positions change as a result?
I wish I could answer that in a definitive manner.
Using the 'hMPXV' seems more stable.
Yes
To rerun then, I would change the following: params.nextclade_dataset = 'MPXV' ==> 'hMPXV'
And leave the following: params.vadr._refererence = 'mpxv' params.species = 'mpx'
I think so?
You could add the params.outgroup back in if you wanted to specify NC_063383.1 as the outgroup when creating a tree with iqtree2.
OK
Those changes worked!!
Victory!
Now that it is working, I am doing a couple of test runs keeping params.iqtree2_outgroup = 'NC_063383.1' to see if the reference genome name is still working:
Sorry I just saw this now. I think I can shed some light on some of your questions (by the way feel free to open issues on Nextclade/Nextclade_data it you aren't sure of something).
Every of the three datasets has a different reference sequence. They all have the same coordinate system (that of NC_068..., the new rivers reference).
Only hMPXV uses a real sequence that is in Genbank.
The other two datasets have that sequence altered by a number of SNPs.
B.1 uses the SNPs of the outbreak polytomy, in this case the MA001 sequence.
MPXV uses the SNPs of a reconstructed ancestor.
None of these reference sequences are changing with new releases. They are stable.
Since we don't really use the ID of the reference sequences for anything I had not chosen it well - sorry for that. I will think of a more expressive and less cumbersome name.
Thanks for that information!
iqtree2 fails because of a wrong header for the outgroup.
nextalign.aligned.fasta header is:
So nextalign.aligned.fasta.renamed header is:
Instead of this:
Solution is to fix sed command like this: sed 's/^.NC_063383./NC_063383/g'