andersen-lab / Freyja

Depth-weighted De-Mixing
BSD 2-Clause "Simplified" License
102 stars 29 forks source link

[BUG] Unexpected Behavior Depending on Reference #233

Closed whottel closed 5 months ago

whottel commented 5 months ago

Hello,

I am running a singularity image of Freyja (freyja:1.5.0-04_23_2024-00-44-2024-04-23) as maintained here: https://hub.docker.com/r/staphb/freyja/tags.

I noticed that the Freyja lineage results could change depending on the format of the MN908947.3 reference file given atfreyja variantscommand. Specifically, I used one version that had AUCG formatting MN908947.3_AUCG.txt and another with ATCG formatting MN908947.3_ATCG.txt (.fasta coverted to .txt to attach here)

The difference in reference impacts the variants output file. 2415979_AUCG_variants.csv 2415979_ATCG_variants.csv

Noticebly, the varaint file generated from the AUCG reference is much larger as a row is generated for every REF U -> ALT T. The depth files appear to be identical: 2415979_AUCG_depths.txt 2415979_ATCG_depths.txt

And ultimately when running freyja demix the lineage results are different. 2415979_AUCG_demix_out.txt 2415979_ATCG_demix_out.txt

Why would extraneous U ->T variant calls result in different lineage calls, assuming that is issue?

joshuailevy commented 5 months ago

Hey @whottel!

For the actual sequencing reads, they are generally going to be in terms of ATGC (even if it is an RNA virus), so your variants output is going to suggest that you have all sorts of (U->T) mutations, that aren't real. Your sample itself uses Ts rather than Us, so the variants step gets confused.

Additionally- under the hood, Freyja is encoding all of the mutations as [Ref base] [site number] [Alt base], so if you have a different ref base (by switching Ts back to U), it'll cause the mutations with T as a reference base to be estimated to have zero abundance in the sample.

It wouldn't be hard to do the associated flipping of Ts to Us in your sample, and then treat Ts as Us in the freyja barcodes, but that seems like a bunch of unnecessary processing. Better to use a reference that reflects the sequencing data and the common mutation nomenclature to minimize those sorts of steps.

Josh

whottel commented 5 months ago

Thanks, I figured I should be using the ATCG refernce but wanted to double check. I will make sure that the reference files we are using on our end are consistent with this.