UPHL-BioNGS / Cecret

Reference-based consensus creation
MIT License
44 stars 22 forks source link

Producing snps and trees from Covid reads input #286

Closed DrB-S closed 2 months ago

DrB-S commented 5 months ago

I ran the newest version of Cecret on 185 pairs of Covid reads files, and it ran super-fast! I expected to see snps and trees, but they weren’t produced. Here is the command-line: nextflow run UPHL-BioNGS/Cecret -profile singularity -c configs/sarscov2_wastewater.config --relatedness true --msa nextalign --freyja_demix_options ‘--depthcutoff 10’ --freyja_boot_options ‘--nb 1000’. I see that both msa.nf and sarscov2.nf use ch_fasta as input. Do I need to rerun using the consensus fastas as input instead, or is there a way to do this directly from the reads?

DrB-S commented 5 months ago

Oops! I see the problem. It should be --msa nextclade!

DrB-S commented 5 months ago

I ran with --msa nextclade, but snps and trees are still not being produced.

erinyoung commented 5 months ago

PhyTreeViz can't calculate a tree with the number of nodes in the nextclade newick file, unfortunately. You should still be able to view this tree with some other tool, but it does have all of the nextclade data nodes included. There's also an auspice file (json) in the nextclade directory that you can use to upload to https://auspice-us.herokuapp.com/ to view.

The SNP matrix should still be created, though. How many samples are you running?

DrB-S commented 5 months ago

185

erinyoung commented 5 months ago

That shouldn't be too many. What files are in your nextclade directory?

DrB-S commented 5 months ago

It didn't produce a nextclade directory. I noticed that I had included NC_063383.1.fasta and NC_063383.1.gff (Monkeypox) in the fastas and gff dir, which must have thrown it all off. I have removed them and am running anew: nextflow run UPHL-BioNGS/Cecret -profile singularity, wastewater --relatedness true --msa nextclade --freyja_demix_options '--depthcutoff 10'. Do I need to specify the outgroup (MN908947.3) on the command-line or will it be pulled automatically from the genomes dir?

erinyoung commented 5 months ago

These are the config settings for wastewater : https://github.com/UPHL-BioNGS/Cecret/blob/master/configs/sarscov2_wastewater.config

Nextclade is turned off for wastewater since it doesn't mean anything.

What kind of samples are you running?

DrB-S commented 5 months ago

Wastewater reads.

DrB-S commented 5 months ago

The pipeline failed at the end, but it did produce a snp matrix: [40/5652c9] process > CECRET:cecret:seqyclean (2023WW0348) [100%] 185 of 185 ✔ [3b/d1e14e] process > CECRET:cecret:bwa (2023WW0348) [100%] 185 of 185 ✔ [c1/6645ae] process > CECRET:cecret:sort (2023WW0348) [100%] 185 of 185 ✔ [96/d6bc91] process > CECRET:cecret:ivar_trim (2023WW0373) [100%] 181 of 181 ✔ [6a/eafdd9] process > CECRET:cecret:ivar (2023WW0373) [100%] 181 of 181 ✔ [- ] process > CECRET:cecret:artic_read_filtering - [- ] process > CECRET:cecret:artic - [06/61445f] process > CECRET:qc:fastqc (2023WW0384) [100%] 185 of 185 ✔ [- ] process > CECRET:qc:kraken2 - [4a/fd0372] process > CECRET:qc:samtools_intial_stats (2023WW0348) [100%] 185 of 185 ✔ [86/8dba39] process > CECRET:qc:aci (2023WW0373) [ 96%] 174 of 181 [f7/047586] process > CECRET:qc:samtools_flagstat (2023WW0373) [100%] 181 of 181 ✔ [78/96e523] process > CECRET:qc:samtools_depth (2023WW0373) [100%] 181 of 181 ✔ [5e/dec0d2] process > CECRET:qc:samtools_coverage (2023WW0373) [100%] 181 of 181 ✔ [49/c0bbce] process > CECRET:qc:samtools_stats (2023WW0373) [100%] 181 of 181 ✔ executor > local (3464) [- ] process > CECRET:fasta_prep - [40/5652c9] process > CECRET:cecret:seqyclean (2023WW0348) [100%] 185 of 185 ✔ [3b/d1e14e] process > CECRET:cecret:bwa (2023WW0348) [100%] 185 of 185 ✔ [c1/6645ae] process > CECRET:cecret:sort (2023WW0348) [100%] 185 of 185 ✔ [96/d6bc91] process > CECRET:cecret:ivar_trim (2023WW0373) [100%] 181 of 181 ✔ [6a/eafdd9] process > CECRET:cecret:ivar (2023WW0373) [100%] 181 of 181 ✔ [- ] process > CECRET:cecret:artic_read_filtering - [- ] process > CECRET:cecret:artic - [06/61445f] process > CECRET:qc:fastqc (2023WW0384) [100%] 185 of 185 ✔ [- ] process > CECRET:qc:kraken2 - [4a/fd0372] process > CECRET:qc:samtools_intial_stats (2023WW0348) [100%] 185 of 185 ✔ [86/8dba39] process > CECRET:qc:aci (2023WW0373) [100%] 174 of 174 [f7/047586] process > CECRET:qc:samtools_flagstat (2023WW0373) [100%] 181 of 181 ✔ [78/96e523] process > CECRET:qc:samtools_depth (2023WW0373) [100%] 181 of 181 ✔ [5e/dec0d2] process > CECRET:qc:samtools_coverage (2023WW0373) [100%] 181 of 181 ✔ [49/c0bbce] process > CECRET:qc:samtools_stats (2023WW0373) [100%] 181 of 181 ✔ [f5/974fa9] process > CECRET:qc:bcftools_variants (2023WW0373) [100%] 181 of 181 ✔ [85/fdb91e] process > CECRET:qc:ivar_variants (2023WW0373) [100%] 181 of 181 ✔ [94/898a61] process > CECRET:qc:samtools_ampliconstats (2023WW0373) [100%] 181 of 181 ✔ [c0/f6b706] process > CECRET:qc:samtools_plot_ampliconstats (2023WW0373) [100%] 181 of 181 ✔ [0a/8f7a8e] process > CECRET:qc:igv_reports (2023WW0373) [100%] 174 of 174 [- ] process > CECRET:sarscov2:vadr - [- ] process > CECRET:sarscov2:pangolin - [- ] process > CECRET:sarscov2:pango_collapse - [bb/c37dc5] process > CECRET:sarscov2:dataset (Downloading NextClade Dataset) [100%] 1 of 1 ✔ [84/4cbe5e] process > CECRET:sarscov2:nextclade (Clade Determination) [100%] 1 of 1 ✔ [1e/0b9dd3] process > CECRET:sarscov2:freyja_variants (2023WW0373) [100%] 181 of 181 ✔ [a3/0cfafe] process > CECRET:sarscov2:freyja_demix (2023WW0373) [100%] 178 of 178 [- ] process > CECRET:sarscov2:freyja_aggregate - [96/d1913b] process > CECRET:msa:phytreeviz (Tree visualization) [100%] 2 of 2, failed: 2, retries: 1 ✔ [b5/1601a5] process > CECRET:msa:snpdists (creating snp matrix with snp-dists) [100%] 1 of 1 ✔ [- ] process > CECRET:msa:heatcluster - [- ] process > CECRET:multiqc_combine - [- ] process > CECRET:summary - Pulling Singularity image docker://staphb/pangolin:4.3.1-pdata-1.24 [cache /data/nextflow_cachedir/staphb-pangolin-4.3.1-pdata-1.24.img] Pulling Singularity image docker://staphb/vadr:1.6.3 [cache /data/nextflow_cachedir/staphb-vadr-1.6.3.img] Pulling Singularity image docker://quay.io/uphl/heatcluster:1.0.2c-2024-01-09 [cache /data/nextflow_cachedir/quay.io-uphl-heatcluster-1.0.2c-2024-01-09.img] [22/1f025d] NOTE: Missing output file(s) phytreeviz/tree.png expected by process CECRET:msa:phytreeviz (Tree visualization) -- Execution is retried (1) [96/d1913b] NOTE: Missing output file(s) phytreeviz/tree.png expected by process CECRET:msa:phytreeviz (Tree visualization) -- Error is ignored ERROR ~ Error executing process > 'CECRET:sarscov2:pangolin'

Caused by: Failed to pull singularity image command: singularity pull --name staphb-pangolin-4.3.1-pdata-1.24.img.pulling.1706563783424 docker://staphb/pangolin:4.3.1-pdata-1.24 > /dev/null status : 255 message: INFO: Converting OCI blobs to SIF format INFO: Starting build... Getting image source signatures Copying blob sha256:578acb154839e9d0034432e8f53756d6f53ba62cf8c7ea5218a2476bf5b58fc9 Copying blob sha256:644cc1f212f602ba382c1b343b65039eca8478ec9997a8a6d93bfffe90d24ad7 Copying blob sha256:a2d7dcebe2368f2ea4f5b52f2af1a55f234cf13fb2eff7332e39ba9e463f9af2 Copying blob sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1 Copying blob sha256:bb02f4fb31804257c11083dc6f3756d02bdf5c700a697707e6df24aecf18ba2b Copying blob sha256:a99d59b7d1ec90f43e2cb49e42a0ad44954bd01b699896abd73c6ee77ad943f7 Copying blob sha256:2ad9805fbbd597b3d4f40918d29216bab5b7a09c1ce353c4fd4c0251d1974a6c Copying blob sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1 Copying blob sha256:af33fed889b0eee32b75996a3baa3621592d0cc017b7e08be2676d866ea0f257 Copying blob sha256:db17c7428ea55d6ed2d2c3a33dfdf90f2bc08ab06549c4b65a9790d2e7fac25e Copying blob sha256:bf0189e4667ce6526f4a593ff5c7dbb3637a4586cad77c90bb6ce6c11a231d18 Copying blob sha256:07e2ea468e00466d83f52be0c5e9b48c0d381019c038877e338ed903c237dc82 Copying blob sha256:5b602560f5480f5440f27b362a15534aef7980e8e12f763c4d61b4426f3f7844 Copying blob sha256:399caa5c4226e9971a64bc3094432f85c7b8a8b215c5f317909ae894f920a0bf Copying blob sha256:d552d70732e78bbc1076ed7f32dc703070bc1d651cd472462c56162841e4ede1 Copying blob sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1 Copying blob sha256:cbace4e6ede6e0b98a21fbd416170136ee448a6c7d4fa404f5ec0f34b5fca1ff Copying blob sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1 Copying config sha256:df652314e69365ed5734e4c9b45d629e8c0afe41210ed2a56a6e81441eac62af Writing manifest to image destination Storing signatures FATAL: While making image from oci registry: error fetching image to cache: while building SIF from layers: conveyor failed to get: no descriptor found for reference "caa2b887caa62ca47ebb0a3e15b5d2f5b823162b1f998120ab89addf94ca363e"

-- Check '.nextflow.log' file for details

erinyoung commented 5 months ago

It looks like pangolin failed to download in that error. That happens for the larger singularity files. If you -resume it should try again. You can also download the image manually and move it to your singularity cache directory

singularity pull --name staphb-pangolin-4.3.1-pdata-1.24.img docker://staphb/pangolin:4.3.1-pdata-1.24
mv staphb-pangolin-4.3.1-pdata-1.24.img <directory where you keep your singularity images>/.
DrB-S commented 5 months ago

Will do. Thanks!

erinyoung commented 5 months ago

I don't think the results for nextclade or pangolin are useful with wastewater samples... or really anything that has to do with the consensus fasta. I'm curious. What you're using this information for?

DrB-S commented 5 months ago

I need lineages, abundance, and reference coverage, which I already have. I thought I would also produce a snp matrix and tree.

DrB-S commented 4 months ago

I have run the newest version of Cecret on reads but neither iqtree2 nor phytreeviz are producing a directory. I created a new config file (contents below) to shorten the command-line:
params.species = 'sarscov2' params.nextclade_dataset = 'sars-cov-2' params.vadr_options = '--split --glsearch -s -r --nomisc --lowsim5seq 6 --lowsim3seq 6 --alt_fail lowscore,insertnn,deletinn' params.vadr_reference = 'sarscov2' params.vadr_trim_options = '--minlen 50 --maxlen 30000' params.iqtree2 = 'true' params.iqtree2_outgroup = 'MN908947.3' params.relatedness = 'true' params.msa = 'nextclade' params.freyja_demix_options = '--depthcutoff 10' params.freyja_boot_options = '--nb 1000'

erinyoung commented 4 months ago

iqtree is only run after mafft. I should allow iqtree to run on the nextclade msa.

I'm guessing the newest version of phytreeviz still isn't liking the nextclade newick file.

DrB-S commented 4 months ago

The nextclade newick file is being produced in the work dir. Here is the error file (maybe the newick file is too big): Matplotlib created a temporary cache directory at /tmp/matplotlib-3tgxgmbj because the default path (/app/becksts/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing. Matplotlib created a temporary cache directory at /tmp/matplotlib-w5yrbqnq because the default path (/app/becksts/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing. Traceback (most recent call last): File "/usr/local/bin/phytreeviz", line 8, in sys.exit(cli.main()) File "/usr/local/lib/python3.9/site-packages/phytreeviz/scripts/cli.py", line 13, in main run(args.dict) File "/usr/local/lib/python3.9/site-packages/phytreeviz/scripts/cli.py", line 44, in run tp.savefig(outfile, dpi=dpi) File "/usr/local/lib/python3.9/site-packages/phytreeviz/treeviz.py", line 892, in savefig fig.savefig( File "/usr/local/lib/python3.9/site-packages/matplotlib/figure.py", line 3390, in savefig self.canvas.print_figure(fname, kwargs) File "/usr/local/lib/python3.9/site-packages/matplotlib/backend_bases.py", line 2156, in print_figure renderer = _get_renderer( File "/usr/local/lib/python3.9/site-packages/matplotlib/backend_bases.py", line 1642, in _get_renderer print_method(io.BytesIO()) File "/usr/local/lib/python3.9/site-packages/matplotlib/backend_bases.py", line 2043, in print_method = functools.wraps(meth)(lambda *args, kwargs: meth( File "/usr/local/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py", line 497, in print_png self._print_pil(filename_or_obj, "png", pil_kwargs, metadata) File "/usr/local/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py", line 445, in _print_pil FigureCanvasAgg.draw(self) File "/usr/local/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py", line 383, in draw self.renderer = self.get_renderer() File "/usr/local/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py", line 398, in get_renderer self.renderer = RendererAgg(w, h, self.figure.dpi) File "/usr/local/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py", line 70, in init self._renderer = _RendererAgg(int(width), int(height), dpi) ValueError: Image size of 2400x454650 pixels is too large. It must be less than 2^16 in each direction.**

erinyoung commented 4 months ago

PhyTreeViz can't calculate a tree with the number of nodes in the nextclade newick file, unfortunately. You should still be able to view this tree with some other tool, but it does have all of the nextclade data nodes included. There's also an auspice file (json) in the nextclade directory that you can use to upload to https://auspice-us.herokuapp.com/ to view.

The SNP matrix should still be created, though. How many samples are you running?

Yeah, phytreeviz is still having issues with how many nodes are in this tree. Have you tried looking at nextclade's newick file in itol or some other software?

erinyoung commented 3 months ago

I tried running the latest version of nextalign (as opposed to nextclade) today, but it gives the same result. I'm glad the multiple sequence alignment file is generated as expected, but the newick tree has too many nodes.

Would it be helpful to you if nextclade's multiple sequence alignment was fed into iqtree2 for phylogenetic tree creation?

DrB-S commented 3 months ago

That could be useful.

erinyoung commented 3 months ago

The latest version of Cecret (https://github.com/UPHL-BioNGS/Cecret/releases/tag/3.13.20240319) will use the multiple sequence alignment file from nextclade in the iqtree2 process to create a tree.