Closed yoavhadas closed 5 months ago
Hi Yoav,
thank you for your continued interest in DTUrtle.
If you already have an aggregated csv file containing transcript counts for all your samples, you can omit the import_counts()
step.
You can see from the help pages of run_drimseq()
which formats for counts data are accepted.
In your case, it would be:
(sparse) matrix with feature counts, where the rows correspond to features (e.g. transcripts). One column per sample/cell with the count data for the specified features must be present (The names of these columns must match with the identifiers in id_col).
So, you can easily read-in and correctly format your csv file, e.g. like this:
# read counts as data.frame
cts <- read.csv("transcript_matrix.csv")
# add TranscriptID as rownames
rownames(cts) <- cts$TranscriptID
# remove TranscriptID-column for matrix conversion
cts$TranscriptID <- NULL
# convert data.frame to matrix
cts <- as.matrix(cts)
Please note: For statistically valid results, DTUrtle relies on actual raw count data. Direct use of abundance estimates (e.g. FPKM/RPKM or TPM etc.) is not recommended.
I hope this solves you problem.
Best, Tobi
Thank you Tobi,
I was able to load the matrix and perform DTU. The results look promising.
Now I try to plot the data, and when I do:
plot_transcripts_view(dturtle = dturtle, genes = "ENSG00000164402.14", gtf = "novel_ref.gtf", genome = 'hg38', one_to_one = NULL)
I get:
Importing gtf file from disk.
Error: subscript contains invalid names
My gtf looks like this:
chr5 PacBio gene 132750136 132807273 . - . gene_id "ENSG00000164402.14";
chr5 PacBio transcript 132750136 132777250 . - . transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; transcript_biotype "protein_coding"; gene_name "SEPT8"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5 PacBio exon 132750136 132752181 . - . transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "10";
chr5 PacBio exon 132760802 132760992 . - . transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "9";
chr5 PacBio exon 132761133 132761265 . - . transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "8";
chr5 PacBio exon 132761458 132761626 . - . transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "7";
chr5 PacBio exon 132761800 132761896 . - . transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "6";
chr5 PacBio exon 132762484 132762645 . - . transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "5";
chr5 PacBio exon 132763706 132763892 . - . transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "4";
chr5 PacBio exon 132764224 132764419 . - . transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "3";
chr5 PacBio exon 132765409 132765529 . - . transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "2";
chr5 PacBio exon 132777108 132777250 . - . transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding"; exon_number "1";
chr5 PacBio CDS 132752019 132752181 . - 1 transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5 PacBio CDS 132760802 132760992 . - 0 transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5 PacBio CDS 132761133 132761265 . - 1 transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5 PacBio CDS 132761458 132761626 . - 2 transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5 PacBio CDS 132761800 132761896 . - 0 transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5 PacBio CDS 132762484 132762645 . - 0 transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5 PacBio CDS 132763706 132763892 . - 1 transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5 PacBio CDS 132764224 132764419 . - 2 transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5 PacBio CDS 132765409 132765529 . - 0 transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5 PacBio CDS 132777108 132777137 . - 0 transcript_id "PB.15230.1"; gene_id "ENSG00000164402.14"; gene_name "SEPT8"; transcript_biotype "protein_coding"; sqanti_structural_category "full-splice_match"; gene_type "protein_coding";
chr5 PacBio transcript 132750143 132777231 . - . transcript_id "PB.15230.3";
Hi @yoavhadas ,
your GTF file does seem to be a bit malformatted. Please note, that every entry should have a gene_id or gene_name and every transcript or other features also a transcript_id set. (the last column of your column does not satisfy this property).
As a quick check, you can read your gtf_file before the plot_transcripts_view()
call with
my_gtf <- import_gtf("novel_ref.gtf", feature_type = NULL, out_df = FALSE)
and check and modify the gtf till the following preconditions are satisfied:
!anyNA(my_gtf $gene_name)
!anyNA(my_gtf $transcript_id[my_gtf $type != "gene"])
Please note, that these potential issues in your GTF could also affect your initial transcrip quantification - thus you might want to check what initially causes the missing IDs and redo the whole analysis.
I hope this resolves your problem.
Best; Tobi
Hello again, I opened a new issue as a followup to: https://github.com/TobiTekath/DTUrtle/issues/6#issuecomment-1124182002
In my csv file, I have barcodes as columns and transcripts as rows:
TranscriptID,CCGCAGTCAGGACTCA,TCATACCACTTGAGTT,TGCGCGTTAATTACCA,AAGTAGAGTACTAAGT Transcript1,0,0,0,0 Transcript2,0,0,0,0 Transcript3,0,0,3.93,0
I can import it to R using read.csv("transcript_matrix.csv"); however, I am not sure what is the proper structure of the matrix and what you mean by: "specify the columns with the data manually."
Any help will be appreciated. Thanks Yoav.