Arabidopsis annotation missing transcipt_name

Hoohm / dropSeqPipe

A SingleCell RNASeq pre-processing snakemake workflow

Creative Commons Attribution Share Alike 4.0 International

147 stars 47 forks source link

Arabidopsis annotation missing transcipt_name #109

Open vuzun opened 4 years ago

vuzun commented 4 years ago

Hi,

I encountered an error after the ReduceGTF step in the pipeline - Invalid GTF line and missing transcript_name.

I am running it on Arabidopsis data, so I changed the rule that fetches the annotation to point to ftp://ftp.ensemblgenomes.org/pub/plants/release-47/gtf/arabidopsis_thaliana, but it looks like the Arabidopsis annotation doesn't have the transcript_name field like the human does.

Do you know if there a way to use the id instead of the name somehow? Or anything else that would work?

Thanks for the tool and all the help around here!

seb-mueller commented 4 years ago

Hi vuzu,

I came across the same issue which seems to come down to a flawed ensemble annotation and dropseq-tools requiring transcript_name to generate the refFlat. Apparently they didn't have non human and mouse samples in mind.

I cobbled together a workaround, which is I still have to test The idea is to copy the the gene_id as transcript_name when it is absent. I'm happy to share the full workflow later (have to run now), but find at least the convert snippet below as a start:

cat annotation.gtf |
awk '{  if (!/gene_name/)
        print $0, "gene_name " $10;
        else print $0
        }' |
awk '{  if (!/transcript_id/)
        print $0, "transcript_id " $10;
        else print $0
        }' |
awk '{  if (!/transcript_name/)
        print $0, "transcript_name " $10;
        else print $0
        }' > tmp.gtf
# this is still run by the pipeline?
sed -i 's/[^"];/_/'  tmp.gtf
cp tmp.gtf annotation.gtf

seb-mueller commented 4 years ago

As an update to the above. I've found a lot of genes being excluded because they have the same name. Drop-Seq-tools excludes them by default. It turns out, transcript_name needs to be as unique as possible. In the snippet above, field $10 was used which corresponds to gene_id, however this removes all the isoforms since they share the same gene_id. Using transcript_id in field $12 turns out to prevent that from happening. So ultimately, the following should be used instead:

cat annotation.gtf |
awk '{  if (!/gene_name/)
        print $0, "gene_name " $10;
        else print $0
        }' |
awk '{  if (!/transcript_id/)
        print $0, "transcript_id " $10;
        else print $0
        }' |
awk '{  if (!/transcript_name/)
        print $0, "transcript_name " $12;
        else print $0
        }' > tmp.gtf
# replace names containing semicolons with underscores since they dropseqtools doesn't handle them well:
sed -i 's/[^"];/_/'  tmp.gtf
cp tmp.gtf annotation.gtf