tkzeng commented 5 years ago

Hi,

With 60,000 transcripts in transcripts_to_keep, I am unable to get past the create_data step (I have tested up to 36 hours, with 8 cores and 56 GB ram). For a different dataset with 44,500 transcripts, everything completed in ~13 hours. Should this difference in number of transcripts increase the runtime so much, and is there anything I can do to fix this?

EDIT: Everything runs in ~2 hours when using salmon in the alignment-based mode, regardless of the number of transcripts.

SimoneTiberi commented 4 years ago

Hi @tkzeng , apologies for the very late reply: I missed your issue. Please ping me with an email too in case this happens again.

create_data can be slow for big data-sets, particularly when reads are compatible with transcripts from distinct genes and many genes are connected by some reads. This issue happens more frequently when mapping reads with pseudo-aligners (e.g., salmon/kallisto); that's why we suggest to do what you did: align reads to the genome first (e.g., with STAR) and use salmon in alignment-based mode to compute equivalence classes.

However, 13 or 36 h are a lot! I think I never went above 1 h...a full differential pipeline for 6 vs 6 comparison on human data should take ~2 h on a laptop (aligning reads to the genome as you did). If your data is publicly available I'd like to test it to investigate internally what the issue is when running create_data.

Thanks for your message.

Ci-TJ commented 4 years ago

@SimoneTiberi Hi! I also encountered this problem, although I set n_cores=24.

~/RNA-seq/Analysis/GSE112055/salmon_out]$ cat ../Run_GSE112055
SRR6868519
SRR6868520
SRR6868521
SRR6868522
SRR6868523
SRR6868524
SRR6868525
SRR6868526
SRR6868527
SRR6868528
~/RNA-seq/Analysis/GSE112055/salmon_out]$ cat quant_salmon.sh
for line in $(cat ../Run_GSE112055)
do
        salmon quant -i /home2/ymwang/linqin/RNA-seq/fasta/release99/mouse_rna_salmon_k31_index  -l A -1 ../download_GSE112055/$line".sra_1.fastq" -2 ../download_GSE112055/$line".sra_2.fastq" -p 24 --validateMappings -o quant_out/$line --seqBias --gcBias --dumpEq && echo $line Done!
done
# quant with salmon
~RNA-seq/Analysis/GSE112055/salmon_out]$ nohup sh quant_salmon.sh &

~/linqin/RNA-seq/Analysis/GSE112055/salmon_out/quant_out/SRR6868519]$ wc -l quant.sf
137935 quant.sf

As for R script,

~/RNA-seq/Analysis/GSE112055/BANDITS]$ cat salmon_BANDITS_GSE112055.sh
library(BANDITS)
library(tximport)
library(rhdf5)

suppressMessages(library(Biostrings))
data_dir = "/home2/ymwang/linqin/RNA-seq/Analysis/GSE112055/salmon_out"
fasta = readDNAStringSet(file.path("/home2/ymwang/linqin/RNA-seq/fasta/release99", "Mus_musculus.GRCm38.rna.fa"))
ss = strsplit(names(fasta), " ")
gene_tr_id_fasta = data.frame(gene_id = gsub("gene:", "", sapply(ss, .subset, 4)),
                              transcript_id = sapply(ss, .subset, 1))
print("1 OK!")

# remove eventual NA's
gene_tr_id_fasta = gene_tr_id_fasta[ rowSums( is.na(gene_tr_id_fasta)) == 0, ]

# remove eventual duplicated rows:
gene_tr_id_fasta = unique(gene_tr_id_fasta)

print("2 OK!")

#Specify the directory of the transcript level estimated counts.
sample_names = c("SRR6868519",paste0("SRR686852", 0:8))
quant_files = file.path("/home2/ymwang/linqin/RNA-seq/Analysis/GSE112055/salmon_out/quant_out", sample_names, "quant.sf")
file.exists(quant_files)

print("3 OK!")

#Load the transcript level estimated counts via tximport.
library(tximport)
txi = tximport(files = quant_files, type = "salmon", txOut = TRUE)
counts = txi$counts
samples_design = data.frame(sample_id = sample_names,group = c("sham", "sham","sham", "sham", "sham", "TAC", "TAC","TAC", "TAC","TAC"))
eff_len = eff_len_compute(x_eff_len = txi$length)

print("4 OK!")

#salmon input(transcripts_to_keep = transcripts_to_keep)
equiv_classes_files = file.path(data_dir, "quant_out", sample_names,
                                "aux_info", "eq_classes.txt.gz")
file.exists(equiv_classes_files)
input_data = create_data(salmon_or_kallisto = "salmon",
                         gene_to_transcript = gene_tr_id_fasta,
                         salmon_path_to_eq_classes = equiv_classes_files,
                         eff_len = eff_len,
                         n_cores = 24)
print("5 OK!")

#Test for DTU
set.seed(61217)
results = test_DTU(BANDITS_data = input_data,
                   samples_design = samples_design,
                   group_col_name = "group",
                   R = 10^4, burn_in = 2*10^3, n_cores = 24,
                   gene_to_transcript = gene_tr_id_fasta)

print("All OK!)

~/RNA-seq/Analysis/GSE112055/BANDITS/R]$ nohup Rscript --save ../salmon_BANDITS_GSE112055.sh &
~/RNA-seq/Analysis/GSE112055/BANDITS/R]$ cat nohup.out
[1] "1 OK!"
[1] "2 OK!"
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[1] "3 OK!"
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10
[1] "4 OK!"
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Data has been loaded

~/RNA-seq/Analysis/GSE112055/BANDITS/R]$

~/RNA-seq/Analysis/GSE112055/BANDITS/R]$ top | head

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
32010 ymwang    20   0 9336484   7.1g  13304 R 100.0  5.7 510:06.55 R

I think parallel computation doesn't work, although I don't know about the parallel cores in R. In fact, I have run with "create_data" step at least 8h, and still not past. Is anything wrong with my code?

SimoneTiberi commented 4 years ago

Hi @Ci-TJ , thanks for your message and for your interest in BANDITS Similar answer to the one I wrote above for @tkzeng: create_data script is in R (while test_DTU runs in C++) and can be very slow if reads are compatible with transcripts from different genes; this can eventually connect thousands of genes whose input is modelled together, which can be slow.

This issue happens more frequently when mapping reads with pseudo-aligners (e.g., salmon/kallisto); that's why we suggest to align reads to the genome first (e.g., with STAR) and use salmon in alignment-based mode to compute equivalence classes (see the README for a short tutorial: https://github.com/SimoneTiberi/BANDITS/blob/master/README.md ). We also noticed that aligning reads to the genome, by decreasing the number of ambiguous reads, increases the accuracy of downstream differential splicing detections (as well as speed of differential methods).

Also, only some tasks of create_data are run in parallel (others cannot be parallelized), so it's normal if you see it running on 1 core.

If you can (and would like to) share the data, I'd like to investigate internally what gets so slow when running create_data.

Have a good one, Simone

Ci-TJ commented 4 years ago

Hi, @SimoneTiberi Thanks for your reply quickly. I doubt that I have set multiple cores, but still run on 1 core. That means _createdata mostly runs with 1 core if reserchers just use salmon/kallisto to quantify. If so, I will try to use STAR for aligment before salmon quantification. Best, Ci

SimoneTiberi commented 4 years ago

Hi again Ci, Yeah aligning reads with STAR is definitely a good idea (when the genome is available).

The main issue with create_data is not the parallel structure, but the fact that, in some datasets, there can be a bottleneck when reads are highly ambiguous between genes (i.e., when they are compatible with transcripts from different genes). It's the second time this issue comes up, so I'll work on this in the future.

Let me know if there are further problems.

Have a good one, Simone

BravoScarleth commented 2 years ago

Hi Simone,

I have the same problem. When I use create_data () it throws me this below message and the data is not loading:

All transcript names in 'names (eff_len)' must be in 'gene_to_transcript [, 2]'

But, my gene_tr_id_gtf file have the transcript row in [, 2], and my count file also has id transcripts

head (equiv_classes_files) [1] "~ / BANDITS / extdata / salmon / sample1 / aux_info / eq_classes.txt" [2] "~ / BANDITS / extdata / salmon / sample2 / aux_info / eq_classes.txt" [3] "~ / BANDITS / extdata / salmon / sample3 / aux_info / eq_classes.txt" [4] "~ / BANDITS / extdata / salmon / sample4 / aux_info / eq_classes.txt" [5] "~ / BANDITS / extdata / salmon / sample5 / aux_info / eq_classes.txt" [6] "~ / BANDITS / extdata / salmon / sample6 / aux_info / eq_classes.txt" head (gene_tr_id_gtf) gene_id transcript_id 1 1433b NM_001141615.1 2 1433z NM_001141343.1 3 143b2 NM_001140200.1 4 143g1 NM_001173658.1 5 143g2 NM_001140016.1 6 2a5e XM_014210585.1 head (beads) [, 1] [, 2] [, 3] [, 4] [, 5] [, 6] [, 7] [, 8] [, 9] [, 10] XM_014160784.1 8 4 6 3 1 3 4 1 0 0 XM_014209195.1 81 37 32 48 23 22 12 18 22 36 XM_014198158.1 3 2 2 2 1 3 1 1 1 0 XM_014126719.1 0 0 0 0 0 0 0 0 0 0 XM_014138059.1 0 0 0 1 1 0 0 0 2 1 XM_014149530.1 0 0 0 0 0 0 0 0 0 0 head (eff_len) XM_014160784.1 XM_014209195.1 XM_014198158.1 XM_014126719.1 XM_014138059.1 1149.4455 3939.4455 974.4455 308.4510 1469.4455 XM_014149530.1 476.4455

Can you see that some genes have a count of zero? This is because neither can get the transcripts_to_keep file either, when I used filter_transcripts () I got the same message:

All transcript names in 'rownames (transcript_counts)' must be in 'gene_to_transcript [, 2]'

This is my command line history:

library(BANDITS) dataDir = system.file("extdata", package = "BANDITS") data("gene_tr_id_gtf", package = "BANDITS") Warning message: In data("gene_tr_id_gtf", package = "BANDITS") : data set ‘gene_tr_id_gtf’ not found head(gene_tr_id_gtf) gene_id transcript_id 1 1433b NM_001141615.1 2 1433z NM_001141343.1 3 143b2 NM_001140200.1 4 143g1 NM_001173658.1 5 143g2 NM_001140016.1 6 2a5e XM_014210585.1 sample_names = paste0("sample", seq_len(10)) quant_files = file.path(dataDir, "salmon", sample_names, "quant.sf") file.exists(quant_files) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE library(tximport) txi = tximport(files = quant_files, type = "salmon", txOut = TRUE) reading in files with read_tsv 1 2 3 4 5 6 7 8 9 10 counts = txi$counts head(counts) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] XM_014160784.1 8 4 6 3 1 3 4 1 0 0 XM_014209195.1 81 37 32 48 23 22 12 18 22 36 XM_014198158.1 3 2 2 2 1 3 1 1 1 0 XM_014126719.1 0 0 0 0 0 0 0 0 0 0 XM_014138059.1 0 0 0 1 1 0 0 0 2 1 XM_014149530.1 0 0 0 0 0 0 0 0 0 0 samples_design = data.frame(sample_id = sample_names, group = c("A","A","A","A","A","B","B", "B", "B", "B")) samples_design sample_id group 1 sample1 A 2 sample2 A 3 sample3 A 4 sample4 A 5 sample5 A 6 sample6 B 7 sample7 B 8 sample8 B 9 sample9 B 10 sample10 B levels(samples_design$group) NULL eff_len = eff_len_compute(x_eff_len = txi$length) head(eff_len) XM_014160784.1 XM_014209195.1 XM_014198158.1 XM_014126719.1 XM_014138059.1 1149.4455 3939.4455 974.4455 308.4510 1469.4455 XM_014149530.1 476.4455 head(counts) counts head(counts) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] XM_014160784.1 8 4 6 3 1 3 4 1 0 0 XM_014209195.1 81 37 32 48 23 22 12 18 22 36 XM_014198158.1 3 2 2 2 1 3 1 1 1 0 XM_014126719.1 0 0 0 0 0 0 0 0 0 0 XM_014138059.1 0 0 0 1 1 0 0 0 2 1 XM_014149530.1 0 0 0 0 0 0 0 0 0 0 transcripts_to_keep = filter_transcripts(gene_to_transcript = gene_tr_id_gtf, transcript_counts = counts, min_transcript_proportion = 0.01, min_transcript_counts = 10, min_gene_counts = 20) All transcript names in 'rownames(transcript_counts)' must be in 'gene_to_transcript[,2]' head(gene_tr_id_gtf) gene_id transcript_id 1 1433b NM_001141615.1 2 1433z NM_001141343.1 3 143b2 NM_001140200.1 4 143g1 NM_001173658.1 5 143g2 NM_001140016.1 6 2a5e XM_014210585.1 equiv_classes_files = file.path(dataDir, "salmon", sample_names, "aux_info", "eq_classes.txt") file.exists(equiv_classes_files) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE equiv_classes_files [1] "~/BANDITS/extdata/salmon/sample1/aux_info/eq_classes.txt" [2] "~/BANDITS/extdata/salmon/sample2/aux_info/eq_classes.txt" [3] "~/BANDITS/extdata/salmon/sample3/aux_info/eq_classes.txt" [4] "~/BANDITS/extdata/salmon/sample4/aux_info/eq_classes.txt" [5] "~/BANDITS/extdata/salmon/sample5/aux_info/eq_classes.txt" [6] "~/BANDITS/extdata/salmon/sample6/aux_info/eq_classes.txt" [7] "~/BANDITS/extdata/salmon/sample7/aux_info/eq_classes.txt" [8] "~/BANDITS/extdata/salmon/sample8/aux_info/eq_classes.txt" [9] "~/BANDITS/extdata/salmon/sample9/aux_info/eq_classes.txt" [10] "~/BANDITS/extdata/salmon/sample10/aux_info/eq_classes.txt" samples_design$sample_id [1] "sample1" "sample2" "sample3" "sample4" "sample5" "sample6" [7] "sample7" "sample8" "sample9" "sample10" input_data = create_data(salmon_or_kallisto = "salmon", gene_to_transcript = gene_tr_id_gtf, salmon_path_to_eq_classes = equiv_classes_files, eff_len = eff_len, n_cores = 20) All transcript names in 'names(eff_len)' must be in 'gene_to_transcript[,2]' head(gene_tr_id_gtf) gene_id transcript_id 1 1433b NM_001141615.1 2 1433z NM_001141343.1 3 143b2 NM_001140200.1 4 143g1 NM_001173658.1 5 143g2 NM_001140016.1 6 2a5e XM_014210585.1 head(counts) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] XM_014160784.1 8 4 6 3 1 3 4 1 0 0 XM_014209195.1 81 37 32 48 23 22 12 18 22 36 XM_014198158.1 3 2 2 2 1 3 1 1 1 0 XM_014126719.1 0 0 0 0 0 0 0 0 0 0 XM_014138059.1 0 0 0 1 1 0 0 0 2 1 XM_014149530.1 0 0 0 0 0 0 0 0 0 0 head(eff_len) XM_014160784.1 XM_014209195.1 XM_014198158.1 XM_014126719.1 XM_014138059.1 1149.4455 3939.4455 974.4455 308.4510 1469.4455 XM_014149530.1 476.4455

I would appreciate the help.

Best, Scarleth.

SimoneTiberi commented 2 years ago

Hi @BravoScarleth, All transcript names in 'rownames(transcript_counts)' must be in 'gene_to_transcript[,2]' this error message, literally means that not all transcripts names that are rownames(transcript_counts) were found in gene_to_transcript [, 2]. So there are 1 or more transcript names in counts that doesn't have a match in gene_to_transcript [, 2].

Can you check if: all( rownames(counts) %in% gene_to_transcript [, 2] ) ? and also how many have a match: mean( rownames(counts) %in% gene_to_transcript [, 2] ) ?

You'd ensure that the gene_tr_id_gtf corresponds to the gene_transcript names that were used to align the reads. See Section 3 here: https://bioconductor.org/packages/devel/bioc/vignettes/BANDITS/inst/doc/BANDITS.html

Additionally (but unrelated), I see:

data("gene_tr_id_gtf", package = "BANDITS")
Warning message:
In data("gene_tr_id_gtf", package = "BANDITS") :
data set ‘gene_tr_id_gtf’ not found

The gene_tr_id object in BANDITS is called gene_tr_id, but really you shouldn't use that one anyway: that's just the file valid for the example we have in the vignettes and documentation.

Let me know if you manage to get it working; again, I think gene_tr_id_gtf is computed on a different gtf compared to the one used to align the RNA-seq reads.

BravoScarleth commented 2 years ago

Hi Simone,

Thanks for the prompt response. I followed your proposal and this happened:

all (rownames (counts)% in% gene_tr_id_gtf [, 2]) [1] FALSE

I was reviewing both files and there were several missing transcripts in the gene_tr_id_gtf file (this one was generated by me, it is different from the one you use in the example). I blame the GFF file I worked with, as it has very poorly annotated transcripts. Do you think it might be possible to edit the counts file (removing these badly annotated transcripts) and then import this edited file so that data loading can work? Regarding how many have a match? This was the answer:

mean (rownames (counts)% in% gene_tr_id_gtf [, 2]) [1] 0.9979539

Finally, version 3 of the genome I'm working with has just come out. I will have to do it all over again with the v3 files. If I have any problem, obviously, I'll let you know. Thank you very much for your help.

Best, Scarleth

SimoneTiberi commented 2 years ago

Hey Scarleth, ok, we found the error.

Well, in theory you'd use the same GFF file that you used to do the alignment. Clearly, most transcripts (0.9979539) are there, so the issue only involves a greatest minority of transcripts.

Removing those transcripts from the counts seems reasonable; if I were you, I'd just check that those transcripts have no or few counts, before removing them.

Let me know if you have further issues with BANDITS.

Have a good one, Simone

BravoScarleth commented 2 years ago

hi again simone

Yes, you got everything right. I've been going through the counts of those poorly annotated transcripts and thankfully they have very few counts. With this, I proceed to filter my file. Thank you very much, Simone. You are very kind.

That you're very well, Scarleth

BravoScarleth commented 2 years ago

Hi Simone,

I was finally able to run Bandits with my 10 samples. The data loaded perfectly. The problem is that it took more than 6 hours just for the step of estimating the maximum number of transcripts per group. I gave it 50 processors to do the task and it only uses one processor, and it's still not done. I really don't understand why it doesn't use all the processors I gave it. In the attached photo you can see that the program only uses one processor.

SimoneTiberi commented 2 years ago

Hey Scarleth, what function exactly? Actually, I don't think there is a function to estimate the maximum number of transcripts per group. Maybe you mean prior_precision?

If you can you send the full pipeline you are running (including printed messages from R), I can see if I can spot anything wrong.

The 2 most computationally intensive functions are: 1 prior_precision -> it actually runs DRIMSeq to estimate the precision parameter (used as a prior in test_DTU); 2 test_DTU -> full MCMC on all genes...so it takes awhile.

I noticed that 1 sometimes takes a long time: to speed up calculations, you can use a subset of the genes from counts. Use the code below, just before prior_precision (then set transcript_counts = transcript_counts computed below).

  counts = transcript_counts
  # select only transcripts_to_keep
  sel = rownames(transcript_counts) %in% transcripts_to_keep
  transcript_counts = transcript_counts[sel, ]

 # get the gene id for every row id in transcript_counts
  N = ncol(transcript_counts)
  matches = match(rownames(transcript_counts), gene_to_transcript[, 2])
  gene_id = as.character(gene_to_transcript[matches, 1])

  # get and counts actual genes (unique)
  genes = unique(gene_id)
  n_genes_tot = length(genes)

  # randomly select 100 genes only:
  sel_genes = sample.int(n_genes_tot, 100)
  transcript_counts = transcript_counts[ gene_id %in% genes[sel_genes], ]

  # compute the precision on these 100 genes only:
  set.seed(61217)
  precision = prior_precision(gene_to_transcript = gene_tr_id,
                            transcript_counts = transcript_counts,
                            n_cores = 50,
                            transcripts_to_keep = transcripts_to_keep)

Let me know if this helps.

NOTE: in sel_genes = sample.int(n_genes_tot, 100) I select 100 genes only. If the computational cost is reasonable, you can also increase that a bit to a few hundreds or 1,000. Clearly, the higher this number the more accurate the answer, but at the same time, we only need prior_precision to estimate 2 numbers (mean and sd of the precision), so a few hundred genes should be enough.

My bad I didn't put this filtering step in the original release (will do asap); to be honest I only recently thought about it.

BravoScarleth commented 2 years ago

Hi Simone

Thanks for your reply, it is very strategic. However, my problem is not directly related with that prior _precision/test DTU functions. In fact, my problem is previous, in the “create_data” function.

input_data = create_data(salmon_or_kallisto = "salmon", gene_to_transcript = gene_tr_id_gtf, salmon_path_to_eq_classes = equiv_classes_files, eff_len = eff_len, n_cores = 20, transcripts_to_keep = transcripts_to_keep) Data has been loaded

In this function there is the option of “n_cores”, my problem is that I put “50 cores”, but you can see from my previous image that the program is just using “1 core”. So, I assume that there is some error with the parallel function. In resume, after run 24 hrs, with just “1 core” the “create_data” function I just get “data has been loaded”. So, do you have some advices for Bandits may effectively use several cores.

Regards, Scarleth

SimoneTiberi commented 2 years ago

Oh I see; I guess that reads were aligned with pseudo-aligners (salmon/kallisto)?

Yeah still look at the initial answers I wrote above in this "issue"; they are still valid in this case: in create_data parallel tasks are limited (btw at most n_samples), most tasks have to run 1 after the other.

BravoScarleth commented 2 years ago

Sure enough, the reads were aligned with Salmon. OK, I'll check those answers. Thanks again.

SimoneTiberi commented 2 years ago

Ok, I though so; yeah check the 3 replies above your comment, they basically contain the answers.

Let me know if you still have doubts.

BravoScarleth commented 2 years ago

Hi Simone,

I finished running test_DTU with my 10 biological replicates fine. But, now I'm running the test with all my samples, that's 42 samples, in 2 conditions 21 x 21, and again I'm having problems with the create_data function. I got these messages.

input_data = create_data(salmon_or_kallisto = "salmon", gene_to_transcript = gene_tr_id_gtf, salmon_path_to_eq_classes = equiv_classes_files, eff_len = eff_len, n_cores = 50, transcripts_to_keep = transcripts_to_keep) Data has been loaded One group of genes has 39385 genes Splitting the group in 39385 groups of individual genes

I don't know what the last message means, because I haven't sent it before. On the other hand, I expected to get something like this:

Max 69 transcripts per group

And did not do it. Do you know what happened?

SimoneTiberi commented 2 years ago

Hi Scarleth, BANDITS deals with counts mapping to multiple genes (it basically allocates them to the gene of origin). BANDITS creates groups of genes that share multi-mapping reads across genes. Now, dealing with such groups is computationally intensive, therefore when these groups are cumbersome (> 100 genes), we split them and treat them separately (a very small approximation).

When you align reads with Salmon/kallisto and have many samples, you can have a very large group (in your case 1 group has basically 40 k genes)...splitting such a large group takes time; you may have to wait for a while!

Btw, in the README, we suggest to align reads with STAR (and then provide alignments to Salmon); this: 1) avoids the issue above; 2) marginally improves performance.

Simone

BravoScarleth commented 2 years ago

Hi Simone, Thanks for your reply. Ok, I understand. Well, I'll take your advice and try to start from the alignment against the genome. Thanks again. Regards, Scarleth

SimoneTiberi commented 2 years ago

Just to clarify: the difference is not massive, but indeed results are a bit more accurate (and computational cost is reduced).

Simone

BravoScarleth commented 2 years ago

OK, thank

jeronimoleberle commented 1 year ago

Hi, Simone. I have 12 samples and ran BANDITS without any problem until I try to create_data , where it says :

Data has been loaded 0 genes to be analyzed

What can I do? I have the following :

input_data = create_data(salmon_or_kallisto = "salmon", gene_to_transcript = gene_tr_id, # nrow(gene_tr_id) == 274081 salmon_path_to_eq_classes = equiv_classes_files, #12 files, each for a sample eff_len = eff_len, #length(eff_len)==234438 n_cores = 2, #my PC has only 8 cores, although it says it's highly recommended to use one core for each sample
transcripts_to_keep = transcripts_to_keep)

I was expecting to have more genes analyzed (36.94% of transcripts were kept after the filtering step)

SimoneTiberi commented 1 year ago

Hey @jeronimoleberle, I think there is probably a mismatch in the id/name of genes or transcripts.

Can you check if the gene_tr_id has the same ids as transcripts_to_keep? sum(transcripts_to_keep %in% gene_tr_id$transcript_id)

If you can't spot the error, can you share a minimal example and share it with me (you can email me: simone.tiberi@unibo.it).

Thanks, Simone

SimoneTiberi / BANDITS

Stalls at create_data step #1

Max 69 transcripts per group