MaayanLab / archs4

ARCHS4 RNA-seq processing scripts and web server pages.
Other
54 stars 10 forks source link

elysium #30

Closed malonzm1 closed 1 year ago

lachmann12 commented 1 year ago

Hi, yes, Elysium is still working. Assuming since the ticket was closed the issue was resolved. I also want to point out that the same pipeline can be accessed with Biojupies: https://maayanlab.cloud/biojupies/upload/reads

malonzm1 commented 1 year ago

Thanks.

malonzm1 commented 1 year ago

Hi,

Does elysium have a command line version?

lachmann12 commented 1 year ago

there is no command line option. However if you want i can help you run the same pipeline locally on your own computer. I would assume this would be easier than uploading your files to a server anyway. For this you need the ARCHS4 index file and the gene mapping information. The rest would be fairly straight forward.

The kallisto index file is located at (human, mouse): https://s3.amazonaws.com/mssm-seq-index/"+organism+"_index.idx

Then run kallisto with this command: index = path of kallisto index file

For single file kallisto quant -t 2 -i "+index+" --single -l 200 -s 20 -o yourfastq.fastq

For paired: kallisto quant -t 2 -i "+index+" -o yourfastq1.fastq yourfastq2.fastq

You will then get an abundance.tsv file with the count data at the transcript level. For gene level they need to be mapped to genes. For this you need to use R. There is a mapping file:

https://s3.amazonaws.com/mssm-seq-genemapping/"+organism+"_mapping.rda

genelevel.r script

library("plyr")

args <- commandArgs(TRUE)

mapping = args[1]
res = load(mapping)

f = "pathto/abundance.tsv"

abu = read.table(f, sep="\t", stringsAsFactors=F)
ugene = cb[,2]

m3 = match(abu[,1], cb[,1])
cco = cbind(abu,ugene[m3])[-1,]
co = cco[,c(6,4)]
co[,1] = as.character(co[,1])
df = data.frame(co[,1], as.numeric(co[,2]))
colnames(df) = c("gene", "value")

dd = ddply(df,.(gene),summarize,sum=sum(value),number=length(gene))
ge = dd[,2]
names(ge) = dd[,1]

write.table(ge, file="gene_abundance.tsv", quote=F, col.names=F, sep="\t")

Run: Rscript --vanilla genelevel.r pathtomappingfile.rda

This should give you a gene_abundance.tsv which is identical to what you would get from using the Elysium pipeline.

Hope this helps.

malonzm1 commented 1 year ago

Many thanks! Just wondering, no read preprocessing (e.g. trimming)?

lachmann12 commented 1 year ago

Read trimming when quantifying gene expression iny opinion is not a good idea. From what I can tell it leads to false negatives without really preventing false positives. The alignment algorithms so not care about misaligned prefixes and suffixes due to the nature of their implementation as far as I know.

malonzm1 commented 1 year ago

Thanks!

malonzm1 commented 1 year ago

Hi,

In the script you provided (genelevel.r), where does the variable cb come from? And where is the variable res used?

malonzm1 commented 1 year ago

Hi,

After generating "gene_abundance.tsv", how can I calculate CPM and TPM?

Thanks.