Figure out why there are duplicate Orthogroups and how to manage that

mezarque commented 2 years ago

Going through the analysis pipeline, after generating Orthogroups and converting the original gxc to exc matrices, it turns out that some Orthogroups end up with the exact same expression values. This suggests that some genes might belong to multiple orthogroups. This could be an issue downstream if two datasets have mutually shared Orthogroups that get filtered out from being able to be used for multiple-species comparison because of different IDs. Should investigate this further to see whether it's really a problem and how to deal with it.

mezarque commented 1 year ago

So, turns out that I've been running OF on all transcript isoforms, which could be contributing to this weird behavior.

I'm planning to update the analysis, rerunning OF using only the longest ORFs per transcript. TransDecoder has a nice functionality to do this using this script, but actually the version provided by conda/mamba doesn't yet include the bugfix described in this commit.

I've manually substituted the \([+-]\) perl regex with the new .* by modifying the script directly. Putting this manual change into the record on GitHub in case someone needs to know this in the future.

mezarque commented 1 year ago

Mentioning get_longest_ORF_per_transcript.pl to make this info a bit more searchable for later

Arcadia-Science / glial-origins

Figure out why there are duplicate Orthogroups and how to manage that #19