d3b-center / d3b-bixu-data-assembly

Apache License 2.0
0 stars 0 forks source link

Collapse and combine GTEx TPM and expected_count to KFDRC collapsed RSEM files #7

Closed kgaonkar6 closed 2 years ago

kgaonkar6 commented 3 years ago

Background:

For OT we will need each data release to included processed GTEx v8 data in gene-expression-rsem-tpm-collapsed.rds and gene-counts-rsem-expected_count-collapsed.rds.

In the last data release I did:

Required update

Update the collapse-rnaseq to be able to handle adding GTEx processed files.

CC @zhangb1 @yuankunzhu @jharenza for discussion.

jharenza commented 3 years ago

@zhangb1 do you have an eta for this ticket?

zhangb1 commented 3 years ago

@jharenza can someone from your team take care of the ticket, maybe run?

I really need to focus on the data assembly cwl develop this week.

jharenza commented 3 years ago

I don't think she has ec2 access yet and we have a lot of open issues for OT that need to be done as well - @yuankunzhu can someone else help with these upstream processes?

zhangb1 commented 3 years ago

Trying to run locally but have the memory issue. can't process

I need to wrap the cwl and try to run on cavatica project.

the error

➜  GTEx Rscript 00-collapse_matrices.R -i gene-expression-rsem-tpm.rds -g gencode.v27.primary_assembly.annotation.gtf.gz -m gene-expression-rsem-tpm-collapsed.rds -t gene-expression-rsem-tpm-collapsed_table.rds -n GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz
[1] "Generating input matrix and drops table...!"
[1] "Read merged GTEx data"

── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  .default = col_double(),
  Name = col_character(),
  Description = col_character()
)
ℹ Use `spec()` for the full column specifications.

[1]    95949 killed     Rscript 00-collapse_matrices.R -i gene-expression-rsem-tpm.rds -g  -m  -t  -n
kgaonkar6 commented 3 years ago

I had to run it on a large EC2, GTEx files are huge

zhangb1 commented 3 years ago

I ran the gtex collapsed data on cavatica project here: https://cavatica.sbgenomics.com/u/zhangb1/test-download/tasks/2d842fee-65cf-4c5b-b1aa-0e458b2dfa29/

and I ran the pbta-gmkf-gene-expression-rsem-tpm-collapsed.rds locally.

when I tried to merge the files using the notebook.

> common_genes <- intersect(rownames(gtex),rownames(pbta_gmkf))
> length(common_genes)
[1] 0

the length of the common_genes is 0.

@kgaonkar6 can you take a look?

zhangb1 commented 3 years ago

wait. I think I found the issue. l will check

zhangb1 commented 3 years ago

@jharenza @kgaonkar6

the merged files has been updated in the bucket:

2021-07-21 14:54:03 1073668357 gene-counts-rsem-expected_count-collapsed.rds
2021-07-21 14:28:30 1611312452 gene-expression-rsem-tpm-collapsed.rds

md5 also updated

kgaonkar6 commented 3 years ago

Thanks a lot @zhangb1 !!

@runjin326 can you try to download these files with the updated download-data.sh pointing to v7 s3 bucket? Since you need for fusion_filteering rerun?

runjin326 commented 3 years ago

@kgaonkar6, @zhangb1, thanks so much! Yes I have downloaded the data and should be able to re-run fusion_filtering now.

runjin326 commented 3 years ago

Oh sorry forget to mention, @kgaonkar6 - the md5sum did not match. I am guessing that is because they are updated?

kgaonkar6 commented 3 years ago

Could you removegene-counts-rsem-expected_count-collapsed.rds and gene-expression-rsem-tpm-collapsed.rds in your local data folder and rerun download-data.sh

runjin326 commented 3 years ago

Yes. Now it's good! Thanks!

komalsrathi commented 2 years ago

Closing this as we have this data available in OT.