conJUSTover / pSONIC

The repository serves as a public and official hosting of the pSONIC program (Conover et al., 2021).
GNU General Public License v3.0
19 stars 3 forks source link

gff3 file editing for MCScanX #4

Closed sjfleck closed 2 years ago

sjfleck commented 2 years ago

Hello, I'm interested in using pSONIC on my data. I'm currently editing my gff3 files to match the "Sp## GeneID Start_POS End_POS" format.

For example, in one of the gff3s, I have 7 lines with the same parent ID. Here are just 3 lines: scaffold_115 MSU_v1 mRNA 11041 12267 . - . ID=Calam.S003580.1;Name=Calam.S003580.1;Parent=Calam.S003580 scaffold_115 MSU_v1 five_prime_UTR 12212 12267 . - . Parent=Calam.S003580.1 scaffold_115 MSU_v1 exon 12122 12267 . - . Parent=Calam.S003580.1 scaffold_115 MSU_v1 CDS 12122 12211 . - 0 Parent=Calam.S003580.1 scaffold_115 MSU_v1 exon 11041 11761 . - . Parent=Calam.S003580.1 scaffold_115 MSU_v1 CDS 11249 11761 . - 0 Parent=Calam.S003580.1 scaffold_115 MSU_v1 three_prime_UTR 11041 11248 . - . Parent=Calam.S003580.1

These features fall under the same parentID. I noticed that the sample gff files that you provided don't have multiple lines with the same ID. Should I just extract the mRNA lines and edit it from there? I would compress those 7 lines down to this single line:

Ca115 Calam.S003580.1 11041 12267

Is this what you expected? thanks!

conJUSTover commented 2 years ago

Yes, just pulling out the mRNA lines works great. Just ensure that the names of the genes match exactly the gene names in the fasta file. I point this out because you'll notice that the Parent field for the mRNA line doesn't end in ".1", while Parent field all the other lines do.

sjfleck commented 2 years ago

Thanks! They do match. It appears in my protein fasta like this:

Calam.S003580.1 MASEELQGSNLQNQAQPPAPVPTTLPQYPEMILIAIEALNEKNDSNKSSISKHIEATYGN LPPAHSTLLTHHLNRMKSIDQLYFIKNNYLKLDPNAPSRRGRGRPPKPKTSLPPGTVLLP PCSRGRPPKSHNPIAPRPPLPTKPKATTTAATVSGKKHGRPSKAATPSVTSTPPPAAGGV PRGRGRPPKVKPAVTASVGA*