UPHL-BioNGS / Cecret

Reference-based consensus creation
MIT License
44 stars 22 forks source link

iqtree2 failure: nextalign.aligned.fasta.renamed has wrong outgroup header for iqtree2 #113

Closed DrB-S closed 1 year ago

DrB-S commented 1 year ago

iqtree2 fails because of a wrong header for the outgroup.

outgroup="-o NC_063383"
cat nextalign.aligned.fasta | sed 's/NC_063383.*/NC_063383/g' > nextalign.aligned.fasta.renamed

nextalign.aligned.fasta header is:

MPXV_USA_2022_MA001 in NC_063383 coordinates

So nextalign.aligned.fasta.renamed header is:

MPXV_USA_2022_MA001 in NC_063383

Instead of this:

NC_63383

Solution is to fix sed command like this: sed 's/^.NC_063383./NC_063383/g'

DrB-S commented 1 year ago

The sed command should have a (*) after each dot, but it didn't render that way above!!

erinyoung commented 1 year ago

Could you share your command and config file (if applicable)?

DrB-S commented 1 year ago

Sure Command: nextflow run UPHL-BioNGS/Cecret -profile singularity,mpx -c configs/monkeypox.config --outdir 2Aug_removed26_27_plusTower

Cecret config file: monkeypox.config.txt

erinyoung commented 1 year ago

Since you are defining species in your config file, there's no need to use the mpx profile.

nextflow run UPHL-BioNGS/Cecret -profile singularity -c configs/monkeypox.config --outdir 2Aug_removed26_27_plusTower

I've removed all the commented out lines in your file for my sake.

tower.accessToken = '<removed>'
tower.enabled = true
params.reads = 'reads'
params.fastas = 'fastas'
params.outdir = 'cecret'
params.maxcpus = 8
params.medcpus = 4
params.reference_genome = "/data/Sequence_analysis/Cecret/Analyses/monkeypox/references/NC_063383.1.fasta"
params.gff_file = "/data/Sequence_analysis/Cecret/Analysis/monkeypox/references/NC_063383.1.gff3"
params.species = 'mpx'
params.cleaner = 'fastp'
params.aligner = 'minimap2'
params.msa = 'nextalign'
fastqc_container = 'staphb/fastqc:0.11.9'
seqyclean_container = 'staphb/seqyclean:1.10.09'
fastp_container = 'staphb/fastp:0.23.2'
bwa_container = 'staphb/bwa:0.7.17'
minimap2_container = 'staphb/minimap2:2.24'
samtools_container = 'staphb/samtools:1.15'
ivar_container = 'staphb/ivar:1.3.1'
bcftools_container = 'staphb/bcftools:1.15'
kraken2_container = 'staphb/kraken2:2.1.2-no-db'
bedtools_container = 'staphb/bedtools:2.30.0'
pangolin_container = 'staphb/pangolin:latest'
nextclade_container = 'nextstrain/nextclade:2.3.0'
vadr_container = 'staphb/vadr:latest'
parallel_perl_container = 'staphb/parallel-perl:20200722'
mafft_container = 'staphb/mafft:7.475'
nextalign_container = 'nextstrain/nextalign:latest'
snp_dists_container = 'staphb/snp-dists:0.8.2'
iqtree2_container = 'staphb/iqtree2:2.1.2'
pandas_container = 'quay.io/biocontainers/pandas:1.1.5'
multiqc_container = 'quay.io/biocontainers/multiqc:1.12--pyhdfd78af_0'
freyja_container = 'staphb/freyja:1.3.4'
params.cleaner = 'fastp'
params.fastp_options = ''
params.aligner = 'minimap2'
params.minimap2_options = '-K 20M'
params.filter = true
params.trimmer = 'none' 
params.ivar_variants = true
params.ivar_consensus = true
params.minimum_depth = 10
params.mpileup_depth = 8000
params.ivar_consensus_options = '-q 20 -t 0.6 -n N'
params.bcftools_variants = true
params.markdup = 'true'
params.kraken2_options = ''
params.kraken2 = true
params.kraken2_db = '/data/Sequence_analysis/databases/kraken2_db/kraken2_PlusPF-16'
params.kraken2_organism = "Monkeypox virus"
params.bedtools_multicov = false
params.samtools_ampliconstats = false
params.samtools_plot_ampliconstats = false
params.pangolin = false
params.freya = false
params.freyja_aggregate = false
params.nextclade_dataset = 'hMPXV_B1'
params.nextclade_dataset_tag = '2022-06-29T12:00:00'
params.nextclade = true
params.vadr_options = '--split --cpu 20 --glsearch -s -r --nomisc --mkey mpxv --r_lowsimok --r_lowsimxd 100 --r_lowsimxl 2000 --alt_pass discontn,dupregin --mdir /usr/local/vadr-models-mpxv-1.4.2-1 mpxv.3.fa my3'
params.vadr_reference = 'monkeypox'
params.vadr_trim_options = '--minlen 50 --maxlen 30000'
params.vadr = true
params.msa = 'nextalign' 
params.nextalign_options = '--genes OPG001,OPG002,OPG003,OPG005,OPG015,OPG016,OPG019,OPG021,OPG022,OPG023,OPG024,OPG025,OPG027,OPG029,OPG030,OPG031,OPG034,OPG035,OPG036,OPG037,OPG038,OPG039,OPG040,OPG042,OPG043,OPG044,OPG045,OPG046,OPG047,OPG048,OPG049,OPG050,OPG051,OPG052,OPG053,OPG054,OPG055,OPG056,OPG057,OPG058,OPG059,OPG060,OPG061,OPG062,OPG063,OPG064,OPG065,OPG066,OPG068,OPG069,OPG070,OPG071,OPG072,OPG073,OPG074,OPG075,OPG076,OPG077,OPG078,OPG079,OPG080,OPG081,OPG082,OPG083,OPG084,OPG085,OPG086,OPG087,OPG088,OPG089,OPG090,OPG091,OPG092,OPG093,OPG094,OPG095,OPG096,OPG097,OPG098,OPG099,OPG100,OPG101,OPG102,OPG103,OPG104,OPG105,OPG106,OPG107,OPG108,OPG109,OPG110,OPG111,OPG112,OPG113,OPG114,OPG115,OPG116,OPG117,OPG118,OPG119,OPG120,OPG121,OPG122,OPG123,OPG124,OPG125,OPG126,OPG127,OPG128,OPG129,OPG130,OPG131,OPG132,OPG133,OPG134,OPG135,OPG136,OPG137,OPG138,OPG139,OPG140,OPG141,OPG142,OPG143,OPG144,OPG145,OPG146,OPG147,OPG148,OPG149,OPG150,OPG151,OPG153,OPG154,OPG155,OPG156,OPG157,OPG158,OPG159,OPG160,OPG161,OPG162,OPG163,OPG164,OPG165,OPG166,OPG167,OPG170,OPG171,OPG172,OPG173,OPG174,OPG175,OPG176,OPG178,OPG180,OPG181,OPG185,OPG187,OPG188,OPG189,OPG190,OPG191,OPG192,OPG193,OPG195,OPG197,OPG198,OPG199,OPG200,OPG204,OPG205,OPG208,OPG209,OPG210 --include-reference'
params.relatedness = true
params.snpdists_options = '-c'
params.snpdists = true
params.iqtree2_options = '-ninit 2 -n 2 -me 0.05 -m GTR -o NC_063383.1'
params.iqtree2 = false

process {
  publishDir = [ path: params.outdir, mode: 'copy' ]

  errorStrategy = 'retry'
  maxRetries = 1

  withLabel: maxcpus {
    cpus = params.maxcpus
  }
  withLabel: medcpus {
    cpus = params.medcpus
  }

  withName:fastqc{
    container = fastqc_container
  }

  withName:seqyclean{
    container = seqyclean_container
  }

  withName:fastp{
    container = fastp_container
  }

  withName:bwa{
    container = bwa_container
  }

  withName:minimap2{
    container = minimap2_container
  }

  withName:'ivar_.*'{
    container = ivar_container
  }

  withName:ivar_consensus{
    memory = '8 GB'
    container = ivar_container
  }

  withName:fasta_prep{
    container = pandas_container
  }

  withName:bcftools_variants{
   container = bcftools_container
  }

  withName:kraken2{
    container = kraken2_container
  }

  withName:bedtools_multicov{
    container = bedtools_container
  }

  withName:'samtools_.*'{
    container = samtools_container
  }

  withName:pangolin{
    container = pangolin_container
  }

  withName:'freyja.*'{
    container = freyja_container
  }

  withName:nextclade{
    container = nextclade_container
  }

  withName:vadr{
    container = 'staphb/vadr:latest'
    errorStrategy = 'ignore'
  }

  withName:summary{
    container = pandas_container
  }

  withName:combine_results{
    container = pandas_container
  }

  withName:mafft{
    container = mafft_container
    errorStrategy = 'retry'
    maxRetries = 2
  }

  withName:nextalign{
    container = nextalign_container
  }

  withName:snpdists{
    container = snp_dists_container
  }

  withName:iqtree2{
   container = iqtree2_container
  }

  withName:'multiqc.*'{
   container = multiqc_container
  }
}
DrB-S commented 1 year ago

Thanks!

erinyoung commented 1 year ago

Here are my observations

  withName:vadr{
    container = 'staphb/vadr:latest'
    errorStrategy = 'ignore'
cpus = 20
  }

if you want use more cpus for this process.

These are the recommended paramaters for using vadr with mpx

params.vadr_options = '--split --glsearch -s -r --nomisc --r_lowsimok --r_lowsimxd 100 --r_lowsimxl 2000 --alt_pass discontn,dupregin'
params.vadr_reference = 'mpxv'
erinyoung commented 1 year ago

Also, what's wrong with nextclade 2.4?

DrB-S commented 1 year ago

On Aug 3, 2022, at 1:22 PM, Young @.***> wrote:

First thing, params.iqtree2 = false, so iqtree2 shouldn't be running at all. I need to test this.

I set params.iqtree2 to false to finish the run, since it was erroring out for iqtree2. Removing mpx from the command-line profile solved the problem and I have now set it to true.

params.msa = 'nextalign' , params.cleaner = 'fastp' are defined twice. They only need to be defined once.

I removed the 2nd definition for each.

params.nextclade_dataset = 'hMPXV_B1' is different than the default, and will have a different outgroup. You'll need to adjust your outgroup to this dataset.

I changed it back to ‘MPXV’. I had changed to hMPXV_B1 because it was mentioned in in the Theiagen Term office hours yesterday.

This parameter isn't used. params.nextclade_dataset_tag = '2022-06-29T12:00:00'

I removed the dataset tag

params.vadr_options = '--split --cpu 20 --glsearch -s -r --nomisc --mkey mpxv --r_lowsimok --r_lowsimxd 100 --r_lowsimxl 2000 --alt_pass discontn,dupregin --mdir /usr/local/vadr-models-mpxv-1.4.2-1 mpxv.3.fa my3' will probably fail because cpu is getting defined twice. You'll actually want to change the process options below

withName:vadr{ container = 'staphb/vadr:latest' errorStrategy = 'ignore' cpus = 20 }

I removed the vadr options and added cpus = 20 to withName:vadr

Thanks!

Stephen M. Beckstrom-Sternberg, PhD Bioinformatics Contractor

Arizona State Public Health Lab Arizona Department of Health Services Cell: (602) 653-5011 Email: @.***

— Reply to this email directly, view it on GitHub https://github.com/UPHL-BioNGS/Cecret/issues/113#issuecomment-1204436035, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTVLJRY7U3LABKDKHT4NJLVXLIJNANCNFSM55P3N7EA. You are receiving this because you authored the thread.

-- CONFIDENTIALITY NOTICE:  This e-mail is the property of the Arizona Department of Health Services and contains information that may be PRIVILEGED, CONFIDENTIAL, or otherwise exempt from disclosure by applicable law.  It is intended only for the person(s) to whom it is addressed.  If you have received this communication in error, please do not retain or distribute it.  Please notify the sender immediately by e-mail at the address shown above and delete the original message.  Thank you.  

DrB-S commented 1 year ago

I’ll try changing it back to nextstrain/nextclade:latest

Stephen M. Beckstrom-Sternberg, PhD Bioinformatics Contractor

Arizona State Public Health Lab Arizona Department of Health Services Cell: (602) 653-5011 Email: @.***

On Aug 3, 2022, at 1:34 PM, Young @.***> wrote:

Also, what's wrong with nextclade 2.4?

— Reply to this email directly, view it on GitHub https://github.com/UPHL-BioNGS/Cecret/issues/113#issuecomment-1204454146, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTVLJWXKCQ2YKXYQKNGM2LVXLJUPANCNFSM55P3N7EA. You are receiving this because you authored the thread.

-- CONFIDENTIALITY NOTICE:  This e-mail is the property of the Arizona Department of Health Services and contains information that may be PRIVILEGED, CONFIDENTIAL, or otherwise exempt from disclosure by applicable law.  It is intended only for the person(s) to whom it is addressed.  If you have received this communication in error, please do not retain or distribute it.  Please notify the sender immediately by e-mail at the address shown above and delete the original message.  Thank you.  

DrB-S commented 1 year ago

We are still getting the same iqtree2 failure. Header is still ">NC_063383 ref_in_coord Reference sequence in coord.fasta coordinates" Instead of ">NC_063383"

DrB-S commented 1 year ago

I think I have found where the iqtree2 reference header problem is coming from: modules/iqtree2.nf, line 27 cat !{msa} | sed 's/!{params.iqtree2_outgroup}.*/!{params.iqtree2_outgroup}/g' > !{msa}.renamed

The nextalign.aligned.fasta header is: nextalign.aligned.fasta header

ref_in_coord Reference sequence in coord.fasta coordinates

We want to modify it to this:

NC_063383.1

So the sed command should be:

  cat !{msa} | sed 's/ref_in_coord..*$/!{params.iqtree2_outgroup}/g' > !{msa}.renamed
erinyoung commented 1 year ago

The header for using the 'hMPXV' dataset from nexclade is >NC_063383.1 Monkeypox virus, complete genome

When params.iqtree2_outgroup = 'NC_063383', the following command

cat !{msa} | sed 's/!{params.iqtree2_outgroup}.*/!{params.iqtree2_outgroup}/g' > !{msa}.renamed

becomes

cat !{msa} | sed 's/NC_063383.*/NC_063383/g' > !{msa}.renamed

This turns >NC_063383.1 Monkeypox virus, complete genome into >NC_063383

This command should also work for ">NC_063383 ref_in_coord Reference sequence in coord.fasta coordinates"

$ echo ">NC_063383 ref_in_coord Reference sequence in coord.fasta coordinates" | sed 's/NC_063383.*/NC_063383/g' 
>NC_063383

What are your 'params.iqtree2_outgroup' and 'params.iqtree2_options' values?

DrB-S commented 1 year ago

params.iqtree2_outgroup is incorporated into params.iqtree2_options

params.iqtree2_options = '-ninit 2 -n 2 -me 0.05 -m GTR -o NC_063383.1'

Do I need to remove the outgroup (-o NC_063383.1) from params.iqtree2_options and instead use params.iqtree2_outgroup?

DrB-S commented 1 year ago

And do I need to use

params.nextclade_dataset = 'hMPXV' rather than params.nextclade_dataset = 'MPXV'

erinyoung commented 1 year ago

Right now, if 'params.iqtree2_outgroup' is empty, nothing happens to the header.

I've found IQTREE2 is a bit finicky about having spaces in the outgroup (or WAS, maybe it's been fixed now), which is why I added in the sed command to fix nextclade's reference.

Do I need to remove the outgroup (-o NC_063383.1) from params.iqtree2_options and instead use params.iqtree2_outgroup?

I recommend this.

params.iqtree2_options = '-ninit 2 -n 2 -me 0.05 -m GTR'
params.iqtree2_outgroup = 'NC_063383.1'
erinyoung commented 1 year ago

And do I need to use

params.nextclade_dataset = 'hMPXV' rather than params.nextclade_dataset = 'MPXV'

The reference may be different depending on what dataset you use from nextclade, but I think both of these use 'NC_063383.1'

DrB-S commented 1 year ago

iqtree2 is still failing. We are still getting the wrong header in nextalign.aligned.fasta & nextalign.aligned.fasta.renamed

">ref_in_coord Reference sequence in coord.fasta coordinates"

erinyoung commented 1 year ago

I have now looked at this closer. I assumed that nextclade would keep the accession number in their reference fasta.

The reference file for 'MPXV' is 'ref_in_coord', so you'll want to set your outgroup to params.iqtree2_outgroup = 'ref_in_coord'

Or you could set it to params.iqtree2_outgroup = '' to avoid the outgroup option altogether.

DrB-S commented 1 year ago

So it will be included in the tree as ‘ref_in_coord’? That is the way it shows up in the snp matrix.

If I avoid the outgroup option, will it still show up in the tree and matrix?

Stephen M. Beckstrom-Sternberg, PhD Bioinformatics Contractor

Arizona State Public Health Lab Arizona Department of Health Services Cell: (602) 653-5011 Email: @.***

On Aug 5, 2022, at 1:24 PM, Young @.***> wrote:

I have now looked at this closer. I assumed that nextclade would keep the accession number in their reference fasta.

The reference file for 'MPXV' is 'ref_in_coord', so you'll want to set your outgroup to params.iqtree2_outgroup = 'ref_in_coord'

Or you could set it to params.iqtree2_outgroup = '' to avoid the outgroup option altogether.

— Reply to this email directly, view it on GitHub https://github.com/UPHL-BioNGS/Cecret/issues/113#issuecomment-1206834744, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTVLJT6OBPYXPYSFSW52CTVXV2BNANCNFSM55P3N7EA. You are receiving this because you authored the thread.

-- CONFIDENTIALITY NOTICE:  This e-mail is the property of the Arizona Department of Health Services and contains information that may be PRIVILEGED, CONFIDENTIAL, or otherwise exempt from disclosure by applicable law.  It is intended only for the person(s) to whom it is addressed.  If you have received this communication in error, please do not retain or distribute it.  Please notify the sender immediately by e-mail at the address shown above and delete the original message.  Thank you.  

erinyoung commented 1 year ago

If you leave out the outgroup, it will still show up in the tree.

DrB-S commented 1 year ago

I was able to get Cecret to finish, including iqtree2, by setting the following in configs/cecret.config (this is the working version of dev/monkeypox_steve.config):

  1. //#Removed 5 Aug 22 as per Erin - params.iqtree2_outgroup = 'NC_063383.1' params.iqtree2_outgroup = ''

  2. Commented out the species flag //params.species = 'mpx' //#Turn this on if not using mpx flag

  3. Included the mpx profile on the command-line

(The alternative is to uncomment the params.species line and NOT include the mpx flag on the command-line)

==

There is still erroneous output in the following: iqtree2 nextalign/nextalign.fasta snp_dists dataset/reference.fasta logs/msa:iqtree2/msa:iqtree2..

They all contain "ref_in_coord..*', rather than 'NC_063383.1'

These can be found by running the following command In the Analysis dir:

grep 'ref_in_coord' */*

DrB-S commented 1 year ago

Here are the errors encountered: ref_in_coord_ERRORS.txt

erinyoung commented 1 year ago

This is the final tree. It looks like it should. The ref_in_coord header is from the nextclade group.

iqtree2/iqtree2.iqtree:(ref_in_coord:0.0017018305,((Consensus_ERR3485802.consensus_threshold_0.6_quality_20:0.0000005071,Consensus_ERR3485801.consensus_threshold_0.6_quality_20:0.0000005071):0.0000384397,(((Consensus_ERR3485797.consensus_threshold_0.6_quality_20:0.0000005071,Consensus_ERR3485799.consensus_threshold_0.6_quality_20:0.0000005071):0.0000005071,Consensus_ERR3485800.consensus_threshold_0.6_quality_20:0.0000005071):0.0000005071,Consensus_ERR3485798.consensus_threshold_0.6_quality_20:0.0000005071):0.0000537884):0.0008604210,(MPXV-UK_P3_completeGenome:0.0000005071,(Consensus_SRR19536729.consensus_threshold_0.6_quality_20:0.0000005071,Consensus_SRR19536728.consensus_threshold_0.6_quality_20:0.0000005071):0.0002536239):0.0011492286);

I'll look at this again tomorrow.

erinyoung commented 1 year ago

I thought that nextclade would use 'NC_063383.1' for both hMPXV and MPXV, but it appears that the reference for MPXV is 'ref_in_coord'. There's not a lot that I can do about that.

I've put in an issue with the nextstrain team (https://github.com/nextstrain/nextclade_data/issues/35), but this is going to be something that they're going to have to fix.

DrB-S commented 1 year ago

As a work-around until they fix this at nextstrain, is there a way to stop the Cecret pipeline after it creates dataset/reference.fasta, change the header to '>NC_063383.1', and resume the pipeline?

erinyoung commented 1 year ago

I hesitate to do that because it doesn't look like the sequence is NC_063383.1. It looks like a predicted ancestral sequence with the most recent one being 3 days ago (https://github.com/nextstrain/nextclade_data/tree/master/data/datasets/MPXV/references/ancestral/versions).

DrB-S commented 1 year ago

Does this mean the ancestral sequence will continue changing? And will reference positions change as a result?

erinyoung commented 1 year ago

I wish I could answer that in a definitive manner.

Using the 'hMPXV' seems more stable.

DrB-S commented 1 year ago

And that is NC_063383.1 as per: https://github.com/nextstrain/nextclade_data/tree/989e4c741ee2552bdfd49de661c912965ad33b28/data/datasets/hMPXV/references/NC_063383.1/versions/2022-06-14T12:00:00Z/files

erinyoung commented 1 year ago

Yes

DrB-S commented 1 year ago

To rerun then, I would change the following: params.nextclade_dataset = 'MPXV' ==> 'hMPXV'

And leave the following: params.vadr._refererence = 'mpxv' params.species = 'mpx'

erinyoung commented 1 year ago

I think so?

You could add the params.outgroup back in if you wanted to specify NC_063383.1 as the outgroup when creating a tree with iqtree2.

DrB-S commented 1 year ago

OK

DrB-S commented 1 year ago

Those changes worked!!

erinyoung commented 1 year ago

Victory!

DrB-S commented 1 year ago

Now that it is working, I am doing a couple of test runs keeping params.iqtree2_outgroup = 'NC_063383.1' to see if the reference genome name is still working:

  1. Changing back to params.nextclade_dataset = 'MPXV'
  2. params.nextclade_dataset = 'hMPXV_B1' params.nextclade_dataset_tag = '2022-06-29T12:00:00'
corneliusroemer commented 1 year ago

Sorry I just saw this now. I think I can shed some light on some of your questions (by the way feel free to open issues on Nextclade/Nextclade_data it you aren't sure of something).

Every of the three datasets has a different reference sequence. They all have the same coordinate system (that of NC_068..., the new rivers reference).

Only hMPXV uses a real sequence that is in Genbank.

The other two datasets have that sequence altered by a number of SNPs.

B.1 uses the SNPs of the outbreak polytomy, in this case the MA001 sequence.

MPXV uses the SNPs of a reconstructed ancestor.

None of these reference sequences are changing with new releases. They are stable.

Since we don't really use the ID of the reference sequences for anything I had not chosen it well - sorry for that. I will think of a more expressive and less cumbersome name.

DrB-S commented 1 year ago

Thanks for that information!