Open jcerca opened 3 years ago
I'm not sure if you ever solved this issue, but for anybody who runs into the same thing you have to be aware that the Seq_removeGFFaltSplicing.py script is looking for both a "longest=" field and a "pacid=" field in column 9 of your GFF file. As far as I can tell this PAC refers to a "phytozome accession ID" and the purpose here is simply a way of grouping each feature within each alternative isoform to the same mRNA entry. For each feature (mRNA, exon, CDS, UTRs) the pacid can be replaced with the "ID" of the parent mRNA entry and the script should execute properly.
So to extend upon the already good solution by @jcerca this is what I did to solve the issue fully:
awk -F "\t" -v OFS="\t" '{
if($3=="mRNA"){gsub("Parent","longest=1;Parent",$9); split($9,ID,";"); split(ID[1],pac,"="); $9=$9";pacid="pac[2]};
if($3=="CDS"||$3=="exon"||$3~/UTR$/){split($9,ID,";"); split(ID[2],pac,"="); $9=$9";pacid="pac[2]}
}1' original.gff > cleaned.gff
NB: This is assuming the "ID" is the first field for each mRNA feature, and the "Parent" is the second field for each CDS, exon and UTR feature, as with the example GFF provided above.
Hi,
I was really impressed with the pipeline you developed and would like to use it for my own work. I managed to install the pipeline and run the test, however with my own data it has proven itself difficult. Here is what I did:
When running the script I get the following error:
Given that my gff does not have "longest=", (since I already selected the longest isoform (assuming that is what is needed here). I did:
This got me the following error: the gff.noAlt is not generated (which I guess it'd be expected and OK)
Thank you for your time. Here's the parameter's file btw:
And the files: