How to prepare inputs for lollipop?

skunklem commented 2 years ago

I would like to run your notebook WwSmoothingKernel to analyze some wastewater samples, but I'm curious how you produced the tally file: tallymut_line_full.tsv. I can imagine using mutlist-full.txt as a kind of key for parsing .vcf files, but I wanted to make sure. Also, if you already have some method for creating the tally file, there's no point in me recreating it. Thanks for the help.

DrYak commented 2 years ago

Hi! we're currently heavily editing and improving the analysis pipeline.

We have a proof of concept inside our V-pipe pipeline: signature.smk (for now it's still hard-coded, it's not yet plugged into our usually schema-based configuration).

Step are:

Using a collection of YAML files describing the mutations found on variants (as generated by- and used with- cojac, dev branch), create mutlists.tsv ( <- what Lollipop will search variants for). (The script used is not yet into LolliPop yet, it's temporarily in V-pipe. This corresponds to "rule mutlist" in the proof-of-concept)
- For each sample:
- V-pipe has: generated from a .BAM file into basecount.tsv file. (It's more or less the equivalent of pileup, but we use a V-pipe internal format for now. It's 5-column coverage-like, for A, C, T, G and -). (you can generate it using the script aln2basecount from smallgenomeutilities 0.3.9, this corresponds to the "rule basecount" of stable version of V-pipe)
- For LolliPop, extact the list of interesting mutations from the basecount.tsv.gz as listed in the mut-list generated in the first step. As you suggest, this is roughly equivalent of getting a VCF/BCF (and coverage / or rely of a optionnal field DP in the VCF) But currently this is done with a per sample TSV file. (The script is not in LolliPop yet, it's temporarily in V-pipe. This corresponds to the rule "sigmut" in the p-o-c)
- We obviously know that relying on bespoke format for the above two steps isn't a very good practice, and we are in the process of writing tools that interface better with more widespread standards (e.g.: using samtools + bcftools to get the mutations, then use VCF (+ coverage) for the per sample intermediate format).
You need to provide for each sample:
- "date": date on which sampling was performed.
- "location": e.g. wastewater treatment plant or catchment aerea.
- in our specific case: we extract these information from the names of the samples (we have setup a specific naming convention with our labs), but feel free to use any way as long as your table provides a location and a date. (This is done with this script currenlty stored into V-pipe, and in the rule timeline of the poc )
- Finally: the tallymut table is assembled by combining every individual sample's extracted mutation list, and the location and date from above. (For now this is simply done used the rust tool XSV to concat and merge the TSVs, in the rule tally mut of the poc ) (A better assembler, using more standardised format per samples, e.g.: VCF/BCF as you suggested, is in the work).

DrYak commented 2 years ago

Note:

The script deconvolute.py from LolliPop is providing a command-line interface to the same deconvolution as performed by the paper's notebook WwSmoothingKernel, but there are currenlty a couple of issue needing debugging).
Since the paper submission (and the notebook), the collumn "city" (in the data subdirectory and used in the notebook) has been renamed to "location" in the above proof of concept.

Further polishing of the proof of concept is going to happen quickly over the comming couple of weeks, so stay in touch. Priority will be to kill bugs in LolliPop, properly integrate the V-pipe proof-of-concept and make it configurable (no more hard-coded values) and release updated version of Cojac (dev into 0.3) and Lollipop (current main into 0.2).

Later, more widespread format will be integrated (by version 0.3 of LolliPop).

We also plan making a wrapper for Galaxy to show possible integration into other pipelines. (initially using 0.2 + smallgenomeutilities' aln2basecount, eventually standard formats using 0.3).

DrYak commented 2 years ago

Again, sorry for the much messy aspect of the wastewater (compared to our usual work on V-pipe) there was a lot of rush to prepare the Recomb paper before the deadline, and to finish the proof-of-concept during last week's biohackathon.

skunklem commented 2 years ago

Thanks for such a quick and detailed response. I understand the occasional need for speed over clean code. I'll see what I can do with all the information you shared and look forward to the finished product.

DrYak commented 1 year ago

Update on the situation:

LolliPop 0.3 will be released in the comming days on Bioconda.
README.md now clearly explains how to generate the necessary inputs and the steps.
- currently, steps using smallgenomeutilities' aln2basecnt is shown (exactly as V-pipe does it internally)
- VCF + coverage as an alternative input postponed to a later version.

cbg-ethz / LolliPop

How to prepare inputs for lollipop? #1