ComparativeGenomicsToolkit / Comparative-Annotation-Toolkit

Apache License 2.0
165 stars 48 forks source link

Orthologs in denovo HGM CGPs #93

Open fbemm opened 6 years ago

fbemm commented 6 years ago

If I understand HGM (especially Augustus CGP HGM) correctly it returns "orthologs" even for CGPs. So the reference might be missing but the rest of the input genomes might share the annotate gene. From what I see in CATs output files this information is still there. Would it makes sense to classify them as novel_orthologs if they are found in at least 2 genomes? They would need a common gene name then. Would be neat if one can construct a multi-genome transmap directly from the GFF files in the end.

ifiddes commented 6 years ago

That is possible. There is actually a flag in HGM called --printHomologs that attempts to do that directly. The reason I don't use this flag is that it is computationally expensive and requires that Augustus be built with the boost library, which is a pain. But it could be an option.

I am not sure I understand what you mean in regards to the HGM output that CAT makes currently. There is no guarantee in CGP that gene X in genome A is the same identifier as gene X in genome B. The --printHomologs output does attempt to make this connection.

lassancejm commented 6 years ago

This sounds like a useful option (may not have to do homolog searching subsequently)

fbemm commented 6 years ago

Exactly what I meant. Right now I can't be sure that gene X in genome A is the same identifier as gene X in genome B. --printHomologs is exactly what I searched for! These novel_orthologs are "biologically" pretty cool ;) One could even think of a plot how the ortholog groups/clusters are sized in the end to see some kind of saturation pattern.

ifiddes commented 6 years ago

Alright, I will work on setting up a parser for the --printHomologs output and integrating it into the pipeline. Need to conceptualize how that will work first.

In the meantime, if you are in a rush, you can run that part directly by just passing all of the annotations as GTF to homGeneMapping with that flag activated.

fbemm commented 6 years ago

Doing exactly that atm. I will report back if I run intro issues.

lassancejm commented 6 years ago

hey, @fbemm, do you mind sharing how you made use of --printHomologs in the context of CAT? @ifiddes : any update on your end? Thinking about re-running the pipeline for the purpose of this thread, and I don't want to reinvent the wheel I something is almost ready.

Thanks !

ifiddes commented 6 years ago

Sorry for the delay, I am behind now that I am working at 10x. I promise to make these changes, but it may take a bit. In the meantime, I recommend trying out --printHomologs. The output is a bit weird to parse, I may be able to provide a parser sooner than full integration.

fbemm commented 6 years ago

Sorry for the delay on my side as well. I wrote a simple parser that takes the homologs results (I ran homGenMapping using the augCGP gtfs), subsets them to groups only containing de novo predictions (looking at CATs final output) and further removes paralogs (could be done with noDupes in the first place). Now going over them by hand to understand their nature. Next step would be to create proper gene loci (e.g., I thought about hgmDN as prefix, a species label and the transcript ID).

ifiddes commented 6 years ago

Nice! One thing to note -- the --noDupes flag does not work as well as it might seem, it is being passed through to halLiftover but @joelarmstrong and I found that it hits some very weird edge cases and generally fails to filter for paralogs. So downstream filtering is advised.

ifiddes commented 6 years ago

Nice! One note — the —noDupes flag does not work as well as advertised. I advise further paralog filtering.

Ian Fiddes

On Feb 6, 2018, at 12:11 AM, Felix Bemm notifications@github.com wrote:

Sorry for the delay on my side as well. I wrote a simple parser that takes the homologs results, subsets them to groups only containing de novo predictions (looking at CATs final output) and further removes paralogs (could be done with noDupes in the first place). Now going over them by hand to understand their nature.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

lassancejm commented 6 years ago

As I am about to restart a new round of CAT, I am wondering if you had some time to work on incorporating the --printHomologs flag (which would justify waiting for it to be avail).

Thanks !

ifiddes-10x-zz commented 6 years ago

I haven't had time yet, sorry. Having a job is taking up my time. I do still want to implement this. It wouldn't be too much work to get the basics in there (adding a flag to activate --printHomologs, and a parser, propagating the results of the parser in tags on the output GFF3/genePred). I could probably try to get that done this week. Adding downstream analyses will be quite a bit more work. Would adding the basics be something you want to wait for?

lassancejm commented 6 years ago

The 'basics' would definitely help, so I can wait a bit; I may start, as it will take a bit of time to get to the augustus-cgp step, and resume after the update is avail. Thanks for keeping fine-tuning CAT (I understand that having a job means less time for this).

ifiddes-10x-zz commented 6 years ago

Alright, I will work on it. I am glad that the modularity is proving to be useful. You should be able to run the pipeline in its entirety, and then upon my update delete the hgm folder and re-run with --rebuild-consensus and it should update accordingly.

ifiddes-10x-zz commented 6 years ago

I just pushed a commit to a new branch (homologs) that will at least run the mode, and parse the results into the database. Next step is to integrate it with consensus finding.

lassancejm commented 6 years ago

Thanks; I had to restart more or less from scratch, so it may take a bit more than I thought before I can tell if it works fine.