andersen-lab / Freyja

Depth-weighted De-Mixing
BSD 2-Clause "Simplified" License
100 stars 29 forks source link

Missing "recombinant parents" #161

Closed wutron closed 12 months ago

wutron commented 1 year ago

Hi Josh and Dylan,

Thanks for all the updates to Freyja! I have been exploring some of the new features in your GitHub repo. However, when I use --depthcutoff from #145, I get the following error:

Traceback (most recent call last):
  File "freyja-dev/bin/freyja", line 33, in <module>
    sys.exit(load_entry_point('freyja', 'console_scripts', 'freyja')())
  File "freyja-dev/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "freyja-dev/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "freyja-dev/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "freyja-dev/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "freyja-dev/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "freyja/_cli.py", line 87, in demix
    df_barcodes = collapse_barcodes(df_barcodes, df_depth, depthcutoff,
  File "freyja/utils.py", line 880, in collapse_barcodes
    parents = lineage_data[alias]['recombinant_parents']\
KeyError: 'recombinant_parents'

I dug into utils.py, and it seems like the code assumes that lineages that start with "X" have recombinant parents: https://github.com/andersen-lab/Freyja/blob/dfdd47f3656768c70c774edd219193164c096e42/freyja/utils.py#L878-L881

However, not all lineages that start with "X" have recombinant parents unless you go up the hierarchy. For example, XBJ.1 does not have recombinant parents in the yml unless you first trace up to its parent XBJ that does have recombinant parents: https://github.com/andersen-lab/Freyja/blob/dfdd47f3656768c70c774edd219193164c096e42/freyja/data/lineages.yml#L30893-L30908

I admit I have not thought much about this grouping by MRCA, so I'm not sure if the "right" solution is to trace up to parent aliases.

I also want to bring up one issue when using the MRCA. Consider the following tree, where a, b, c, and d are lineages:

     /-- a  
  /--|
  |  \-- b
--|
  |  /-- c
  \--|
     \-- d

Correct me if I am wrong, but if there are convergent mutations, I think it is possible for a and c to have barcodes that group together and b and d to have barcodes that group together (once you remove sites based on --depthcutoff). Grouping by MRCA would cause everything to group into a MRCA = root even though a and c are distinguishable from b and d.

As a suggested alternative, in my workflow, if I have lineages that are indistinguishable by barcode, instead of tracing up to an MRCA, I designate one as the "canonical" lineage name, and I have a separate file that shows all lineages that map to each canonical lineage.

Thanks, Jessica

joshuailevy commented 1 year ago

Hi Jessica/@wutron!

Thanks for bringing this up (and for all of your contributions as of late)! The --depthcutoff option is still in somewhat of a beta testing phase, but you're right- this will not work properly for recombinant lineage descendants. Instead of checking for "X" lineages I believe the correct thing to is to check if the recombinant_parents property exists for that lineage. Should be a relatively quick fix @dylanpilz?

Regarding your question about the collapsing to the root: In the scenario you specified in which b and d are no longer distinguishable but their siblings a and c are, you'd have b and d grouped into a lineage that would be assigned to the root, but a and c would not get merged into this same group. They would remain separate (if they are not distinguishable from one another they should get merged in a different group- also with the root name, but with an additional number to show that it's not the same as the b/d group), and these lineage groupings are noted in the yml file that also gets returned.

The use of one as the "canonical" lineage name is an interesting idea, but I worry that it can be a bit misleading- what we might call EG.4 could easily be grouped into with a ton of other XBB lineages in a low coverage sample, even if we don't have any evidence to separate it from other XBB lineages. Or is there a non-random way that your "canonical" lineage is chosen?

Josh

wutron commented 1 year ago

Point 1: Sounds good on the quick fix.

Point 2: I see where it introduces a suffix number to the mrca name now. That resolves my concern then!

P.S. The "canonical" name is just a randomly chosen name, and I rely on the mapping to determine whether a lineage name maps to multiple lineages. It sounds like we have the same process, e.g. find what to collapse and show what collapses in a separate file. But I can see how renaming the lineages has less potential for confusing users.

Thanks!

dylanpilz commented 1 year ago

Hey Jessica,

I'm finishing up the fix and was wondering if you could share one or more offending samples along with the command you ran that led to the error?

Dylan

wutron commented 1 year ago

Can I share via email instead?

joshuailevy commented 1 year ago

For sure! Feel free to send it to jolevy@scripps.edu, and I'll share it with Dylan.