Open vuzun opened 4 years ago
Hi vuzu,
I came across the same issue which seems to come down to a flawed ensemble annotation and dropseq-tools requiring transcript_name
to generate the refFlat. Apparently they didn't have non human and mouse samples in mind.
I cobbled together a workaround, which is I still have to test The idea is to copy the the gene_id as transcript_name when it is absent. I'm happy to share the full workflow later (have to run now), but find at least the convert snippet below as a start:
cat annotation.gtf |
awk '{ if (!/gene_name/)
print $0, "gene_name " $10;
else print $0
}' |
awk '{ if (!/transcript_id/)
print $0, "transcript_id " $10;
else print $0
}' |
awk '{ if (!/transcript_name/)
print $0, "transcript_name " $10;
else print $0
}' > tmp.gtf
# this is still run by the pipeline?
sed -i 's/[^"];/_/' tmp.gtf
cp tmp.gtf annotation.gtf
As an update to the above. I've found a lot of genes being excluded because they have the same name. Drop-Seq-tools excludes them by default. It turns out, transcript_name
needs to be as unique as possible. In the snippet above, field $10 was used which corresponds to gene_id
, however this removes all the isoforms since they share the same gene_id
. Using transcript_id
in field $12 turns out to prevent that from happening. So ultimately, the following should be used instead:
cat annotation.gtf |
awk '{ if (!/gene_name/)
print $0, "gene_name " $10;
else print $0
}' |
awk '{ if (!/transcript_id/)
print $0, "transcript_id " $10;
else print $0
}' |
awk '{ if (!/transcript_name/)
print $0, "transcript_name " $12;
else print $0
}' > tmp.gtf
# replace names containing semicolons with underscores since they dropseqtools doesn't handle them well:
sed -i 's/[^"];/_/' tmp.gtf
cp tmp.gtf annotation.gtf
Hi,
I encountered an error after the ReduceGTF step in the pipeline - Invalid GTF line and missing transcript_name.
I am running it on Arabidopsis data, so I changed the rule that fetches the annotation to point to ftp://ftp.ensemblgenomes.org/pub/plants/release-47/gtf/arabidopsis_thaliana, but it looks like the Arabidopsis annotation doesn't have the transcript_name field like the human does.
Do you know if there a way to use the id instead of the name somehow? Or anything else that would work?
Thanks for the tool and all the help around here!