Sequences categorized into types of pseudogenes

navkahlon240 commented 1 year ago

Hi, Thank you for this awesome pipeline for pseudogenes analysis. I just wanted to know if I can get the fasta sequences categorized as Short, long, fragmented and intergenic sequences. Because, I think it shows the total number of short, long, fragmented and intergenics in log. Is there any way it can give the nucleotide sequences categorized like which sequences are short, long, fragmented, because I am interesting to do further analysis on long sequences.

Thanks.

mitchso commented 1 year ago

Hi,

The categorical information for each pseudogene is found in the GFF output file. From there you can identify the locus tags associated with the group of pseudogenes you are interested in analyzing further, and then pull the sequences that correspond to those locus tags from the fasta files.

Hope this helps! Mitch

liamfriar commented 1 year ago

Hi,

I also love the tool. The "Reason(s):" list appears to always be blank when the reason is that the feature was input as a pseudogene. It is still relatively easy to parse because of the pseudogene vs. pseudogene candidate designation in the .gff. I bring it up because when I then called re-annotate, it always has 0 input pseudogenes. Maybe that is just how reannotate works, but I thout it might have something to do with the lack of annotation in the .gff file? It looks in "annotate.py" like the pseudogene reason strings are sometimes saved in reason_dict, sometimes as pseudo_reasons, and sometimes as pseudo_candidate_reasons, so maybe these objects aren't all communicating with each other properly?

Thanks again. Great tool!

mitchso commented 1 year ago

Thanks for bringing this to my attention! I'll clean up the labelling and data structure soon. Best, Mitch

filip-husnik / pseudofinder

Sequences categorized into types of pseudogenes #56