LuyiTian / scPipe

a pipeline for single cell RNA-seq data analysis
67 stars 24 forks source link

gencode GFF3 annotation sc_exon_mapping fault #59

Open ssmax1 opened 6 years ago

ssmax1 commented 6 years ago

Hi, After attempting to use GFF3 annotations from gencode with sc_exon_mapping I am getting a segfault during "add gff3 annotation" (see below). Do you recommend a source for GFF3 annotation?

sc_exon_mapping(BAM, paste0(BAM,".xmap"), "refgenome/gencode.vM16.primary_assembly.annotation.gff3", bc_len = 12, UMI_len = 8, fix_chr = FALSE) add gff3 annotation: refgenome/gencode.vM16.primary_assembly.annotation.gff3

caught segfault address 0x771, cause 'memory not mapped'

Traceback: 1: .Call("_scPipe_rcpp_sc_exon_mapping", PACKAGE = "scPipe", inbam, outbam, annofn, am, ge, bc, mb, bc_len, UMI_len, stnd, fix_chr) 2: rcpp_sc_exon_mapping(inbam, outbam, annofn, bam_tags$am, bam_tags$ge, bam_tags$bc, bam_tags$mb, bc_len, UMI_len, stnd, fix_chr) 3: sc_exon_mapping(BAM, paste0(BAM, ".xmap"), "refgenome/gencode.vM16.primary_assembly.annotation.gff3", bc_len = 12, UMI_len = 8, fix_chr = FALSE)

Shians commented 6 years ago

Any GFF3 should be ok to use, segfaults are probably due to something malformed in the BAM. Could you run samtools view {your_bam_name} | head -n 10 for me?

ssmax1 commented 6 years ago

Thanks for the quick reply, here is the head of the bam file -

AGAGTCTACTAT_CACATGGC#ST-E00522:327:HK77CCCXY:3:1101:24312:4016 16 chr6 72385233 255 12S138M 0 0 TTTTTTTTTTTTGATACCAAGAGGTCTTTATTGCCCACCAGCCACCAACAGTTTCCCAGCCACAGACAGGGTCCTCTGGGTCTCTGTAACCCACTCTCAGGACCATGAGATGCAGCTACAGTCCAGCTGTCTCCTCCCAGGGGATCAAGG JFJFJJFJJJFAJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<JJJJJJJJJJJJJFFFAA NH:i:1 HI:i:1 AS:i:136 nM:i:0 AAGGCTGACACA_CTCGTAAG#ST-E00522:327:HK77CCCXY:3:1101:24332:4016 16 chr13 74172145 255 100M1051N50M 0 0 TGCTTGCATGGGGCCCTGCTCCAGCGTCTCCATGATGGTCTGTTTCATGTCCCGGCTGGCATCTCCTCGCAGTGCCAGCAGTGCACCAATGTGGTCATCTCTGATGTCTGGATATTTGCTGACCAGAGTTGAGACTTCTAGATAGAGCAG FFF-FJJA-JJJJJJFJJJAJJJJJAJFFJJF-JJJJJJJJJJJJFF7<JJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJFFJJJJJFJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJFFFAA NH:i:1 HI:i:1 AS:i:149 nM:i:0 AACATTCGAACG_GAGCATCT#ST-E00522:327:HK77CCCXY:3:1101:24353:4016 16 chr5 143507384 255 103M609N47M 0 0 CCTTCAGCTTCTCAATGGTGTCCTTATCATCCCTAAGATCAAGCTTCGTCCCCACGAGGATGATAGGAGTATTGGGACAGTGGTGTCGCACTTCAGGATACCACTTTGCACGGACATTTTCAAATGATGCAGGACTCACAAGGGAAAAGC JFFJJJJA<FFAJJFJJJAFAJAFF<FJF<JA<JJJJFFJJFF--<AA7JJJJJJJJJJJJJFAJJJJJJJJJJFJJJJFJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA NH:i:1 HI:i:1 AS:i:150 nM:i:0 GGCGCTCTGTTT_GTTATAGT#ST-E00522:327:HK77CCCXY:3:1101:24393:4016 0 chr4 134925306 255 127M1302N23M 0 0 ACTAGAAATAGCAAAAGCCAACGCAGCAAAAGCTCTGGGAACAGCCAACTTCGACTTGCCAGCAAGTCTCCGAGCCAAGGAGGCAAGCCAGGGGACAGCTGTTTCCAGCAGTGGGCCAAAGGTGGAGCATTCAGAAAAGCAGACTGAAGA AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJ<JJJFJJJJJJJJJJJJJAFJJJJJ<<AJJ-FFJJJJJJJJ<AAFJAJJFJJAFJJJFJJF<FJJJFJJ<<<<J NH:i:1 HI:i:1 AS:i:149 nM:i:0 AAAAGGCTCATT_AACGTTTT#ST-E00522:327:HK77CCCXY:3:1101:24413:4016 0 chr7 109519197 255 26M362N64M269N60M 0 0 CCTTTTCCCTTTCTGCCACCGCCATGCCATCCAGACTGAGGAAGACCCGGAAACTCCGGGGCCACGTGAGCCACGGCCACGGCCGCATCGGTAAGCACCGCAAGCACCCAGGCGGCCGCGGGAATGCTGGAGGCATGCACCACCACAGGA AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJAJFFJJJJJJJJJJJJFJJJJJJJFFAJJJJJJJJJJJJJJJFJFJJJJF7FFJJAJJJJJJJJJJJJJJFFJJJJFJ<JF<-A7FFJJFJJFJFJFF NH:i:1 HI:i:1 AS:i:152 nM:i:0 TAACCCGACCCA_AGCCCAAA#ST-E00522:327:HK77CCCXY:3:1101:24434:4016 0 chr13 49673158 255 150M 0 0 GGAAGGAAGAAAAAGATAGCCTTTTTAGTTTTATACAATTACACCATTGTCATCAATATGCCATTTATTTCAAAGCTTATTGTTTTAGATTTGACAGAATGTAGAGAAAAGTGTAAAAGCAAGTCCTACAATTTTATATCCAATTTCAGG AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJFJ7AJFJJJJJJJJFJJFJFJJJFJJJJJJJFJAJAFJJJJJA<FA NH:i:1 HI:i:1 AS:i:148 nM:i:0 ATCGAATAAGTA_ATGTCCAC#ST-E00522:327:HK77CCCXY:3:1101:24495:4016 0 chr1 150403874 255 150M 0 0 GTATCGAGAAAAACGCTTAGAACAAGAAAAGGAATTGCTACACAATCAAAATTCATGGCTGAACACAGAGTTGAAAACCAAAACTGATGAGCTGTTGGCTCTAGGAAGAGAAAAAGGAAATGAAATTCTGGAACTTAAGTGTAATCTTGA A<FAAJJFFJJJJJJJJFJJFAFAFJFJJJJJJJJJJJJJJJFJJJJJJJJJJFJJ<JJJFJJFJ<JJJJFJJ<FFJJJJJJFFJJJJJJJJJJJJJJJJJJ<FFJJAJFJJJJJAFJFJJJJFJ7JFJJJFFFFAFJJFFFJFJJJJJF NH:i:1 HI:i:1 AS:i:148 nM:i:0 ATCCTTAGACAC_GACGCCGA#ST-E00522:327:HK77CCCXY:3:1101:24515:4016 16 chr12 28633771 255 62M376N88M 0 0 CTGGGGCGCTTCTGCTTATTTTTCGTACGGCTTTTCCGGGTTGGCTTGGGCAGAATCCTCCTCTGAGCAATGAAGACCACGTGCTTGCCACTGAACTTTTTCTCCAATTCACGAACTAGCCGGACTTGGATTTTCTGGAAAGATTTAAGC J<JJJFAJAFJFJJFJFJFF-A7JJJFJJJJAJJJJJJJJJJFJJJJJJJJJJJFFJJJJFJJFJ7JAJJJJJJJJJJJJJJJJFJJJJJJJJJJJJFFJAJFJJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJJFFFAA NH:i:1 HI:i:1 AS:i:150 nM:i:0 GAGTATCAGTCG_AAGCAATC#ST-E00522:327:HK77CCCXY:3:1101:24535:4016 16 chr5 100562641 255 122M9382N28M 0 0 GAAGCAATCACACAGGCTGGTCTGCCAGTTGGAATTTTGGGGAGCACGAACGAATCCAGGTTGAGGCACGAGAACTGTTGGTGCCTGAGCCATCTTCGATGCGAGGTCTCAAAAGACCAAGTAATTCAAAAGGGTTGGGCAGAACCGGTC <FA<---<JJF7AA<F<-77)7)A77AFFAJAJFJFAAJJFA)FJJAFFJJFFAAF<FJF7-<<A-JJJA--F7A<FFA777A7--F7JAJF-7-7-F<--F<FFA-F--FAAFFA77<JFFA-FFFJJFFFJAF--7-AFA7-A7A-AA NH:i:1 HI:i:1 AS:i:143 nM:i:3 GGAAATTGTCAG_GGCTTTTT#ST-E00522:327:HK77CCCXY:3:1101:24616:4016 16 chr17 33996386 255 91M139N39M172N20M 0 0 TGACTAAAGAGAACTGAGGGCTCTGGATGTCACAGGAGATATGAGAAGACATTGTCTGTCACCAAGTCCACTCCAGGCAGCTGTCTTCACGCTTTACAATCTGGGAGAGACAGATCAGAGGTCTGGGAGCCTGGAGCCAGAGCATAGTCC A<F77JJJJJJF<<<JJJJA<<AA<-FAJFJJJJJJFJJFJJJJFJJAJJJJFJFJJJFJJJFJJJFJJJJFJJFJJFJJJJJJJJFFJFJJJJJJAFJJFFJJJJJJJJJJJJFJJJJJJJJAAJJJJJJJJJJJJJJJJJJJJFFFAA NH:i:1 HI:i:1 AS:i:152 nM:i:0

ssmax1 commented 6 years ago

Hi Shian, I tried the ensembl GFF3 annotation which works, but have noticed that some exons seem to be excluded leaving many barcode matched and aligned but unassigned reads. I suspect it may be exon parent - transcript parent - gene issue.

ensembl gff3 file - Mus_musculus.GRCm38.91.gff3

specific gene exon of interest Setd7 - 3 ensembl_havana exon 51515319 51521487

3 ensembl_havana gene 51515319 51560879 . - . ID=gene:ENSMUSG00000037111;Name=Setd7;biotype=protein_coding;description=SET domain containing (lysine methyltransferase) 7 [Source:MGI Symbol%3BAcc:MGI:1920501];gene_id=ENSMUSG00000037111;havana_gene=OTTMUSG00000017099;havana_version=3;logic_name=ensembl_havana_gene;version=9

3 ensembl_havana mRNA 51515319 51560879 . - . ID=transcript:ENSMUST00000037141;Parent=gene:ENSMUSG00000037111;Name=Setd7-201;biotype=protein_coding;ccdsid=CCDS17341.1;havana_transcript=OTTMUST00000041418;havana_version=2;tag=basic;transcript_id=ENSMUST00000037141;transcript_support_level=1;version=8

3 ensembl_havana exon 51515319 51521487 . - . Parent=transcript:ENSMUST00000037141;Name=ENSMUSE00000454592;constitutive=0;ensembl_end_phase=-1;ensembl_phase=2;exon_id=ENSMUSE00000454592;rank=8;version=6

On 1 March 2018 at 12:08, Shian Su notifications@github.com wrote:

Any GFF3 should be ok to use, segfaults are probably due to something malformed in the BAM. Could you run samtools view {your_bam_name} | head -n 10 for me?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/LuyiTian/scPipe/issues/59#issuecomment-369439807, or mute the thread https://github.com/notifications/unsubscribe-auth/AjObW-1h9qaEzFbFBIvPeumn4KFQAmXeks5tZ0oYgaJpZM4SXbV- .

-- Research Assistant Epigenetics in Human Health and Disease Laboratory Department of Diabetes | Central Clinical School Monash University | AMREP Campus

Level 5, Alfred Centre 99 Commercial Rd, Melbourne VIC 3004 https://maps.google.com/?q=99+Commercial+Rd,+Melbourne+VIC+3004+Australia&entry=gmail&source=g Australia https://maps.google.com/?q=99+Commercial+Rd,+Melbourne+VIC+3004+Australia&entry=gmail&source=g

T: (03) 99030485 E: scott.maxwell1@monash.edu

Shians commented 6 years ago

Yes the BAM file seems to be fine, I'll have to investigate the GENCODE annotation later.

With regards to reads not being assigned, it is expected that many reads do not match with exons, on particularly good samples we'd still see around 25% of the aligned reads not fall into an annotated region. Roughly 50% of reads end up in an annotated exon.

I wouldn't worry unless your numbers are significantly worse than that. Any gene that does not have any counts will not be recorded in the count matrix (there are no rows with all zeros). If you are certain of a gene's presence you can try this to extract the reads that should have been assigned to a particular exon. If no reads match the the gene is not present in the data.

ssmax1 commented 6 years ago

Thanks for the advice! I have since narrowed the problem down and managed to obtain counts for the exon of interest.

As there are reads that aligned but not assigned to the exon of interest, I examined the GFF3 file from Ensembl. I removed all overlapping features -3' / 5' UTR, CDS, ncRNA etc - leaving only exons and their parent relationships. This has resolved my issue, but indicates a need to provide an additional option in sc_exon_mapping to choose which features of a GFF3 file to include/ exclude from mapping.

Cheers, Scott

Shians commented 6 years ago

Thanks for that, it sounds like we need to have a look at how we parse GFF3 files.

ssmax1 commented 6 years ago

Sorry, I was mistaken before, the reads that should appear in the exon are still not counted downstream.

The summary from sc_exon_mapping indicates "unique map to exon" for these reads but they are not counted in demultiplexing and gene counting as there are no reads in gene_count.csv for the gene ENSMUSG00000037111 / Setd7

On inspection there are 1000's of reads that map to this exon - Setd7 - 3 ensembl_havana exon 51515319 51521487

ssmax1 commented 6 years ago

After further examination... If I remove all annotation from the Ensembl file "Mus_musculus.GRCm38.91.gff3" except for lines matching ENSMUSG00000037111 / Setd7 exons and linked parents - mRNA/ gene lines, the reads are counted and appear in gene_count.csv. So there does appear to be overlapping features in the full GFF3 annotation which need the option to be excluded.

Shians commented 6 years ago

So if I am understanding correctly, the feature you are interested in was overlapping with another feature, and removing the overlapping feature allowed your feature to be counted? Our count assignment strategy probably needs to be updated to handle this, it wouldn't be sensible to exclude overlapping features, since they may be of interest (as in your case).

Instead we will probably work towards providing options to either, not assign counts in overlapped regions, assign counts to both features when overlapped or randomly assign to one of the features if overlapped.

Also it turns out that we don't handle non-ENSEMBL gffs at all, gffs are very inconsistent in their final column and we will put in bit of work to get support going for RefSeq and Gencode.

ssmax1 commented 6 years ago

Hi Shian, Thank you, yes that is correct. I haven't pinned down which overlap was causing the problem, but excluding everything but the exon of interest did allow the reads to be counted. I look forward to your future updates and developments with scPipe!!

Shians commented 6 years ago

I didn't write the GFF parsing function so I've had to figure out the code myself. The current strategy is as follows

for entry in gff:
    if entry has type exon:
        transcript = parent
        gene = transcript_to_gene_dict[transcript]
        add exon to gene // parent of transcript is a gene
    else:
        if entry has ID:
            transcript_to_gene_dict[transcript] = parent

This works under the relatively fragile assumption that an entry with an ID and a parent must be a transcript, this is true for ENSEMBL annotation but not Gencode, where every entry has an ID. That is why Gencode does not parse.

Unfortunately using the type field and looking for "transcript" is extremely unreliable, as there are a dozen different types synonymous for transcript, such as lnc_RNA or aberrant_processed_transcript. Given I can't seem to find any proper documentation of who makes these up, I can't come up with a solution that isn't equally fragile.

HOWEVER, from what I can see the GFF3 parsing works fine for the current release (V91) ENSEMBL annotations. With my test dataset I can find the gene you mentioned. A manual check shows that although your whole gene overlaps with ENSMUSG00000102466, your particular exon should not be overlapping.

For the time being I will work on supporting Gencode annotations, but I cannot figure out why the gene is not reported for you.

Shians commented 6 years ago

I know this kind of data is usually quite sensitive, but it would help a lot if you could provide a reproducible example. Maybe run the pipeline on a dataset that's already public and see if you lose the gene?

Actually what would be really helpful is if you could direct me to exactly where you downloaded your annotation and extracting the relevant reads from your BAM files using samtools view input.bam "Chr3:51515319-51560879" > output.sam

LuyiTian commented 6 years ago

@ssmax1 I think it might relate to a bug in exon counting that has been fixed in the C++ version of scPipe but haven't been fixed in R version yet. If it is not too time-consuming, could you try the C++ version of scPipe and see if you still get the same error? The functions and argument are pretty similar to the R side but in command line tool format. note that you need a newer gcc which support the C++11 to compile and run the code.

ssmax1 commented 6 years ago

Thanks for looking into this and for the suggestions @Shians & @LuyiTian. I am currently re-aligning my data, but will isolate the reads and try the C++ version once complete. This is a link to the GFF3 annotation that I was using- ftp://ftp.ensembl.org/pub/release-91/gff3/mus_musculus/Mus_musculus.GRCm38.91.gff3.gz