Transcripts matrix - Githubissues

yoavhadas commented 1 year ago

Hello again, I opened a new issue as a followup to: https://github.com/TobiTekath/DTUrtle/issues/6#issuecomment-1124182002

In my csv file, I have barcodes as columns and transcripts as rows:

TranscriptID,CCGCAGTCAGGACTCA,TCATACCACTTGAGTT,TGCGCGTTAATTACCA,AAGTAGAGTACTAAGT Transcript1,0,0,0,0 Transcript2,0,0,0,0 Transcript3,0,0,3.93,0

I can import it to R using read.csv("transcript_matrix.csv"); however, I am not sure what is the proper structure of the matrix and what you mean by: "specify the columns with the data manually."

Any help will be appreciated. Thanks Yoav.

TobiTekath commented 12 months ago

Hi Yoav,

thank you for your continued interest in DTUrtle.

If you already have an aggregated csv file containing transcript counts for all your samples, you can omit the import_counts() step.

You can see from the help pages of run_drimseq() which formats for counts data are accepted. In your case, it would be:

(sparse) matrix with feature counts, where the rows correspond to features (e.g. transcripts). One column per sample/cell with the count data for the specified features must be present (The names of these columns must match with the identifiers in id_col).

So, you can easily read-in and correctly format your csv file, e.g. like this:

# read counts as data.frame
cts <- read.csv("transcript_matrix.csv")
# add TranscriptID as rownames
rownames(cts) <- cts$TranscriptID
# remove TranscriptID-column for matrix conversion
cts$TranscriptID <- NULL
# convert data.frame to matrix
cts <- as.matrix(cts)

Please note: For statistically valid results, DTUrtle relies on actual raw count data. Direct use of abundance estimates (e.g. FPKM/RPKM or TPM etc.) is not recommended.

I hope this solves you problem.

Best, Tobi

yoavhadas commented 12 months ago

Thank you Tobi, I was able to load the matrix and perform DTU. The results look promising. Now I try to plot the data, and when I do: plot_transcripts_view(dturtle = dturtle, genes = "ENSG00000164402.14", gtf = "novel_ref.gtf", genome = 'hg38', one_to_one = NULL) I get:

Importing gtf file from disk.
Error: subscript contains invalid names

My gtf looks like this:

chr5    PacBio  gene    132750136   132807273   .   -   .   gene_id "ENSG00000164402.14";
chr5    PacBio  transcript  132750136   132777250   .   -   .   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; transcript_biotype "protein_coding"; gene_name "SEPT8"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5    PacBio  exon    132750136   132752181   .   -   .   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "10";
chr5    PacBio  exon    132760802   132760992   .   -   .   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "9";
chr5    PacBio  exon    132761133   132761265   .   -   .   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "8";
chr5    PacBio  exon    132761458   132761626   .   -   .   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "7";
chr5    PacBio  exon    132761800   132761896   .   -   .   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "6";
chr5    PacBio  exon    132762484   132762645   .   -   .   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "5";
chr5    PacBio  exon    132763706   132763892   .   -   .   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "4";
chr5    PacBio  exon    132764224   132764419   .   -   .   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "3";
chr5    PacBio  exon    132765409   132765529   .   -   .   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "2";
chr5    PacBio  exon    132777108   132777250   .   -   .   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "1";
chr5    PacBio  CDS 132752019   132752181   .   -   1   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5    PacBio  CDS 132760802   132760992   .   -   0   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5    PacBio  CDS 132761133   132761265   .   -   1   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5    PacBio  CDS 132761458   132761626   .   -   2   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5    PacBio  CDS 132761800   132761896   .   -   0   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5    PacBio  CDS 132762484   132762645   .   -   0   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5    PacBio  CDS 132763706   132763892   .   -   1   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5    PacBio  CDS 132764224   132764419   .   -   2   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5    PacBio  CDS 132765409   132765529   .   -   0   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5    PacBio  CDS 132777108   132777137   .   -   0   transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5    PacBio  transcript  132750143   132777231   .   -   .   transcript_id "PB.15230.3";

TobiTekath commented 11 months ago

Hi @yoavhadas ,

your GTF file does seem to be a bit malformatted. Please note, that every entry should have a gene_id or gene_name and every transcript or other features also a transcript_id set. (the last column of your column does not satisfy this property).

As a quick check, you can read your gtf_file before the plot_transcripts_view() call with

my_gtf <- import_gtf("novel_ref.gtf", feature_type = NULL, out_df = FALSE)

and check and modify the gtf till the following preconditions are satisfied:

!anyNA(my_gtf $gene_name)

!anyNA(my_gtf $transcript_id[my_gtf $type != "gene"])

Please note, that these potential issues in your GTF could also affect your initial transcrip quantification - thus you might want to check what initially causes the missing IDs and redo the whole analysis.

I hope this resolves your problem.

Best; Tobi

yoavhadas commented 11 months ago

I was able to fix the GTF file using AGAT. Thanks

TobiTekath / DTUrtle

Transcripts matrix #15