epi2me-labs / wf-single-cell

Other
69 stars 35 forks source link

Only unknown transcripts #140

Open HenriettaHolze opened 1 week ago

HenriettaHolze commented 1 week ago

Ask away!

Hi, I would look at transcript usage but all transcripts are "unknown". The file transcript_processed_feature_bc_matrix/features.tsv.gz contains only "unknown" transcripts.

unknown_00000   ENST00000000233 Gene Expression
unknown_00001   ENST00000000412 Gene Expression
unknown_00002   ENST00000000442 Gene Expression
unknown_00003   ENST00000001008 Gene Expression
unknown_00004   ENST00000002125 Gene Expression

The transcriptome.gff.gz looks as follows:

chr3    StringTie       transcript      9789378 9792721 .       +       .       transcript_id "chr3.stringtie.38.1"; gene_id "chr3.stringtie.38"; gene_name "ARPC4"; xloc "XLOC_000017"; ref_gene_id "ENSG00000241553"; cmp_ref "ENST00000498623"; class_code "o"; tss_id "TSS21";
chr3    StringTie       exon    9789378 9789963 .       +       .       transcript_id "chr3.stringtie.38.1"; gene_id "chr3.stringtie.38"; exon_number "1";
chr3    StringTie       exon    9791260 9791487 .       +       .       transcript_id "chr3.stringtie.38.1"; gene_id "chr3.stringtie.38"; exon_number "2";
chr3    StringTie       exon    9792209 9792721 .       +       .       transcript_id "chr3.stringtie.38.1"; gene_id "chr3.stringtie.38"; exon_number "3";

The pipeline takes a gtf file with transcript information as input.
Does this output mean that no transcripts given in the gff file were recovered, or does the pipeline not match transcripts identified by stringtie to known transcripts? Or is there an option to not run stringtie but only quantify known transcripts given in the gtf file?

Cheers,
Henrietta

HenriettaHolze commented 1 week ago

Never mind, I clearly just can't read.
The known transcript IDs are given in the features.tsv.gz file, I only need to match them to the genes.

Different question now though, all features in transcript_raw_feature_bc_matrix/features.tsv.gz and transcript_processed_feature_bc_matrix/features.tsv.gz have a ENST... ID but transcriptome.gff.gz contains transcripts without the cmp_ref flag. Are the novel transcripts not quantified per cell?

Cheers