carolzhou / multiPhATE2

multiPhATE with comparative genomics
18 stars 10 forks source link

gene numbers from "consensus" should be consecutive #35

Closed xvazquezc closed 3 years ago

xvazquezc commented 3 years ago

Hi,

I've been running multiPhATE2 on a number of related phages and I realised that the consensus output has missing gene numbers. For example, one genome has 276 protein coding-genes, but the gene/protein names go until chr1_consensus_303_geneCall. This is reflected in all the final files when running with primary_calls='consensus': the protein file, the gff,...

Not sure if this is intended but I find odd that the final numeration for the genes is not consecutive.

PS: my guess this is related to the fix done for #25 that would remove the genes that shouldn't be in the consensus, but it doesn't address the loss of consecutive enumeration

Cheers,

carolzhou commented 3 years ago

The consensus gene calls are calculated per genome, by comparing all of the results for gene callers on that genome. It is possible for the consensus set to contain more calls than the result set from any given gene caller. As I recall, the consensus set should be numbered consecutively, but if you are seeing non-consecutive numbering, it would be helpful for me to take a look at the results you are getting for one of your genomes. Please post here or email to multiphate@gmail.com. Thank you.

xvazquezc commented 3 years ago

Yes, that's exactly what it's happening. I'll send you the files by email.

carolzhou commented 3 years ago

Thank you for sending the files, and for identifying an imperfection in the code! I looked through the consensus output, comparing it to the CGC_results.txt, which confirms that occasionally the consecutive numbering in the consensus output skips a number. This is happening when one of the gene callers calls a gene that is unique with respect to all of the other callers. This is going to happen most often with PHANOTATE, since that gene caller is more likely to detect a gene that the others do not. Because each consensus gene call's number is unique (though not necessarily consecutive), this should not affect any downstream analyses you might do with the consensus data. Therefore, I'm going to fix this issue in the next version of Multiphate2. Thank you for submitting this issue!