Open pramodsshinde opened 1 year ago
Just a quick clarification, we will have two datasets as you denoted with D1 and D2 right? If I rewrite your steps it would be:
@joreynajr that's correct. We will have two harmonized datasets, D1 and D2. D2 is more for internal testing to see the extent of overlap between 2021 and 2022 datasets and to study whether the longitudinal 2021 dataset on its own can predict 2022 responses.
Thanks for the clarification!
Working on a new branch and I started restructuring the repo: https://github.com/CMI-PB/cmi-pb-multiomics/tree/re-harmonize
I just have a few questions/suggestions:
2020_2021_specimen.csv
and 2020_2021_subject.csv
should be split up between These seem like nit-picky things but I really think they should just be solved for the sake of everyone/anyone using this dataset.
I did some manual parsing and data-wrangling to get the following files:
This is what should be available for download from: https://www.cmi-pb.org/downloads/cmipb_challenge_datasets/2nd_cmipb_challenge/10202022/
I found another tricky issue, when downloading the following files they are actually saved using tabs as delimiters but the extension says ".csv":
2020_abtiters.csv, 2020_live_cell_percentage.csv, 2020_olink.csv, 2020_rnaseq.csv
2021_abtiter.csv, 2021_live_cell_percentage.csv, 2021_olink.csv, 2021_rnaseq.csv
I've manually changed this on my end but definitely needs to be fixed.
I also ran into another issue, for olink some proteins are given using two ids, here are some examples of that happening:
I'm going to handle this by adding a "-" between multiple names and creating a new file.
Pramod: Perfect!!
I actually noticed that the 2022 olink file, "2022BD_olink_protexp.csv", uses "" so I'll switch over to this for multiple names
Another issue, 2020/2021 olink files don't have a header but 2022 does. The files use to have this header but now don't?
Actually, the problem is not so much that there isn't a header but the header changed. 2020 uses:
specimen_id,olink_id,uniprot_id,prot_exp,unit,lloq,uloq,QC
2021 uses:
specimen_id,olink_id,uniprot_id,prot_exp,unit,lloq,uloq,QC
and 2022 uses:
uniprot_id,specimen_id,lower_limit_of_quantitation,protein_expression,quality_control,unit,upper_limit_of_quantitation
Hi @joreynajr
Thanks for noting down all the issues. I worked on the issue and the modified files. These files can be accessible here. Please let me know if you note any additional issues.
I also ran into another issue, for olink some proteins are given using two ids, here are some examples of that happening:
- Q29983,Q29980
- P29459,P29460
- Q14213,P29459
These are basically protein complexes that's why two Uniprot ids are associated with them. We are working on creating ontology terms for them, but it will take some time (for 3rd challenge).
I found another tricky issue, when downloading the following files they are actually saved using tabs as delimiters but the extension says ".csv":
2020_abtiters.csv, 2020_live_cell_percentage.csv, 2020_olink.csv, 2020_rnaseq.csv 2021_abtiter.csv, 2021_live_cell_percentage.csv, 2021_olink.csv, 2021_rnaseq.csv
I've manually changed this on my end but definitely needs to be fixed.
This problem is fixed now. I changed file extensions to tsv and added \t as the delimiter.
I still see the same thing on my end when I use this link: https://www.cmi-pb.org/downloads/cmipb_challenge_datasets/2nd_cmipb_challenge/10202022/
Sorry for the confusion. I created a new file dir here: https://www.cmi-pb.org/downloads/cmipb_challenge_datasets/2nd_cmipb_challenge/11282022/
Thanks! I'll start taking a look
Raw datasets are available here