Hoohm / dropSeqPipe

A SingleCell RNASeq pre-processing snakemake workflow
Creative Commons Attribution Share Alike 4.0 International
147 stars 47 forks source link

Problem creating refFlat #78

Closed Hofphi closed 5 years ago

Hofphi commented 5 years ago

Hello everyone,

today I ran into some problems in generating the metadata for the single-cell pipeline. I got an unclear error message in the create_refFlat rule when running dropSeqPipe on a Gallus gallus (chicken) genome build from ensemble (release 95).

I went back to dropseqtools and executed the ConvertToRefFlat script manually in the command line. After that I realised that my gtf was missing the attributes "gene_name" and "transcript_id" in some rows. I saw that @Hoohm had posted a similar issue for Arabidopsis in the dropseq google group. Additionally I found an answer from one of the developers of dropseqtools who suggested to add a gene_name field that has the same values as the gene_id and do the same for trascript_name as well!

Does anybody have a suggestion of how to achieve this task using awk or similar command line based tools?

here is a truncated example of the file:

#!genome-build GRCg6a
#!genome-version GRCg6a
#!genome-date 2018-03
#!genome-build-accession NCBI:GCA_000002315.5
#!genebuild-last-updated 2018-03
1   ensembl gene    5273    10061   .   -   .   gene_id "ENSGALG00000054818"; gene_version "1"; gene_source "ensembl"; gene_biotype "protein_coding";
1   ensembl transcript  5273    10061   .   -   .   gene_id "ENSGALG00000054818"; gene_version "1"; transcript_id "ENSGALT00000098984"; transcript_version "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding";
1   ensembl exon    9774    10061   .   -   .   gene_id "ENSGALG00000054818"; gene_version "1"; transcript_id "ENSGALT00000098984"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSGALE00000484266"; exon_version "1";
1   ensembl exon    9314    9364    .   -   .   gene_id "ENSGALG00000054818"; gene_version "1"; transcript_id "ENSGALT00000098984"; transcript_version "1"; exon_number "2"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSGALE00000485748"; exon_version "1";
1   ensembl exon    8674    8904    .   -   .   gene_id "ENSGALG00000054818"; gene_version "1"; transcript_id "ENSGALT00000098984"; transcript_version "1"; exon_number "3"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSGALE00000464173"; exon_version "1";

When I was looking through the generate_meta rule I also noticed that there was an additional step of curating the annotation file ("rule_curate_annotation") which does not appear in dropseqtools. Can anyone explain to me what this function does and why you implemented this step in the pipeline?

I would appreciate your help with this.

Cheers, Philipp

Hoohm commented 5 years ago

Hey @Hofphi

As a matter of fact I did have to do the same for ERCCs I added to our references. Here is the command I used: awk -F'[\t|;]' '{printf $0" "; gsub(/id/,"name"); print $9";"$10"; exon_version \"1\";"}'

It might not work out of the box for you but at least you get a template for what you need.

For your question regarding the curate_annotation rule, this is linked to the gtf_biotypes.yaml file that is needed to run the pipeline now.

The default file has all the biotypes and will hence not change your annotation. What you can do now with this new feature is delete certain biotypes that you don't want to map. This should decrease a bit the multimapping and transfer them to uniquely mapped, at least in theory.

I added this because I wanted to compare mapping rates and gene counts with cellRanger from 10x where they use a pretty small subset of the existing biotypes.

Cheers

Hofphi commented 5 years ago

Hey,

thank four for clarifying the curate_annotation rule.

I tried your command for fixing the annotation-file with some variation but unfortunately it was not doing what I was looking for. However, I used the following:

awk '{
        if (!/gene_name/)
        print $0, "gene_name " $10;

        if (!/transcript_id/)
        print $0, "transcript_id " $10;

        else print $0
        }'  directory/FILE.gtf > directory/FILE_corrected.gtf

It works fine but I have to run this script twice since the action "print" (in case the first condition is TRUE) skips the second if condition. Do you have any suggestions how to fix this issue?

Since you already provide a script to automatically download the genome build I was thinking that you could maybe implement a simple fix for the annotation file in the generate_meta rule or the download_meta_single rule.

MarleyCodes commented 5 years ago

I'm trying to get the pipeline running with ftp://ftp.ensemblgenomes.org/pub/plants/release-42/gtf/zea_mays/Zea_mays.B73_RefGen_v4.42.gtf.gz but the pipeline crashes during the ReduceGtf and ConvertToRefFlat steps. I tried both of your fixes, but they haven't worked. Running STAR genomeGenerate works fine for the above file and genome.fasta btw. So it should adhere to gtf standards. Could you send me an example of a functioning gtf? I would get a better understanding of what's "wrong" with my file then.

Hofphi commented 5 years ago

For the dropseq-tools ConvertToRefFlat and ReduceGTF to run properly you will need to provide the following attributes for all rows/annotation in the gtf-file:

You should check if you have these attributes in your file.

Hoohm commented 5 years ago

I actually had to add this exon_version 1 as well. I tried with ERCCS added to the gtf and it would not work without this.

MarleyCodes commented 5 years ago

Thanks for both of your replies. I'll try to work this out today and see if it works.

EDIT: Your suggestions worked and the pipeline no longer crashes on the gtf file, but it does crash on the *plot_violine.R step:

cannot open file '/data/sag/2019/MYE_18020068_2019_SingleCellTomBras/dropSeqPipe/batch1/.snakemake/scripts/tmpv3_g5lck.plot_violine.R': No such file or directory

If this isn't the right place for this error I'll open a new issue.