andersen-lab / Freyja

Depth-weighted De-Mixing
BSD 2-Clause "Simplified" License
102 stars 29 forks source link

-like lineages grouped within "Other" VOC #213

Closed ktmeaton closed 6 months ago

ktmeaton commented 7 months ago

Hi Freyja Team,

Is it intentional that when using the --depthcutoff parameter, the -like lineages are all grouped within the Other summarized? For example, if I disable the cutoff (--depthcutoff 0), then I can get an ambiguous mix of BA.1.1 and BA.1.4 for a total of 99.9% Omicron. But if I enable the cutoff (--depthcutoff 1), regardless of value, I get BA.1-like that is 99.9% Other.

Ideally, I think I'd want to see the -like lineages counted towards their known VOC (ex. Omicron). But is it intentional that they're grouped under Other instead?

depthcutoff summarized lineages abundances resid coverage
SRX19564628_depthcutoff_0 [('Omicron', 0.9997535724749947)] BA.1.1 BA.1.4 0.49987679 0.49987679 1.4054486336076677 86.13517038424239
SRX19564628_depthcutoff_1 [('Other', 0.9997535723720241)] BA.1-like 0.99975357 1.4054486340159833 86.13517038424239

Commands

Place sample data in data/.

Download the freyja database:

mkdir database
wget -O database/lineages.yml https://raw.githubusercontent.com/andersen-lab/Freyja-data/main/history_lineage_hierarchy/lineages02_16_2024-00-48.yml
wget -O database/usher_barcodes.csv https://raw.githubusercontent.com/andersen-lab/Freyja-data/main/history_barcodes/usher_barcodes02_16_2024-00-48.csv
wget -O database/curated_lineages.json https://raw.githubusercontent.com/andersen-lab/Freyja-data/main/history_metadata/curated_lineages02_16_2024-00-48.json

Run freyja demix:

mkdir results

docker run \
  -v $(pwd)/data:/data \
  -v $(pwd)/database:/database \
  -v $(pwd)/results:/results \
  quay.io/biocontainers/freyja:1.4.9--pyhdfd78af_0 \
  freyja demix \
    --confirmedonly \
    --barcodes /database/usher_barcodes.csv \
    --meta /database/curated_lineages.json \
    --lineageyml /database/lineages.yml \
    --depthcutoff 1 \
    --output /results/SRX19564628_depthcutoff_1.demix \
    /data/SRX19564628.variants.txt \
    /data/SRX19564628.depth.txt
ktmeaton commented 7 months ago

On a related note, with the default --depthcutoff 0, is the lineage abundace getting evenly split between all possible lineage candidates?

BA.1-like= 99.9% -> BA.1.1 = 49.9% + BA.1.4 = 49.9% ?

In this particular sample, there is 0 coverage at the lineage disciminating sites: G22599A and G23351A. Which I'm interpreting as this sample could be either BA.1.1 or BA.1.4. Does splitting ir risk underestimating the abundance?

(Sorry if these questions are already documented somewhere else!)

joshuailevy commented 7 months ago

Hey @ktmeaton!

Thanks for bringing this up. At present the summarized output doesn't properly use the collapsed groupings generated via use of the --depthcutoff parameter, but this is something we're hoping to incorporate soon.

Relatedly- the groupings present issues for plotting, as collapsed groups can differ across samples. We're planning to leverage the group MRCAs to make this work relatively soon.

And yes, if Freyja is unable to differentiate between lineages given the available coverage, it will return equal fractions of the possible lineages back to you. The correct thing to do in this case IMO is to instead collapse into a BA.1-like lineage, that includes BA.1.1 and BA.1.4.

Josh

ktmeaton commented 7 months ago

Thanks so much for the response! I was thinking that collapsing to BA.1-like was the preferred approach, but just wanted to check.

Regarding MRCAs, do you have any wisdom/experience to pass along about getting MRCAs between recombinants? Especially for those crazy nested ones like XDM (XDA + GW.5)? I've attempted to write some functions that follow all possible ancestors, but I'm not sure if that's the right approach, and it's dizzying to test. And then breaking ties on top of that gets complicated (ex. XE and XG would have multiple MRCAs? BA.1 and BA.2?). Just curious what you've tried for freyja!

joshuailevy commented 7 months ago

No prob! Yeah, it gets pretty tricky when dealing when we get into recombinants. For the nested ones, generally you can break things down the nested ones by pulling the parents of recombinant, and then looking for overall MRCA. Of course if those parents are pretty distinct (like a Delta/Omicron recombinant, the MRCA heads towards the root pretty fast).

Since we're still in the development phase of that functionality - right now a lot of these cases with recombinants will result in a "Misc" group name, but hopefully more to follow up on that soon!