Creating harmonized datasets for 2nd prediction challenge

pramodsshinde commented 1 year ago

To identify feature overlap between the [2020 + 2021] dataset and the 2022 dataset (D1)
To identify feature overlap between the 2021 dataset and 2022 dataset (D2)
Perform dataset harmonizations for D1 and D2, separately

Raw datasets are available here

joreynajr commented 1 year ago

Just a quick clarification, we will have two datasets as you denoted with D1 and D2 right? If I rewrite your steps it would be:

Create a harmonized dataset called D1 which uses the feature overlap between 2020, 2021, and 2022.
Create a harmonized dataset called D2 which uses the feature overlap between 2021 and 2022.

pramodsshinde commented 1 year ago

@joreynajr that's correct. We will have two harmonized datasets, D1 and D2. D2 is more for internal testing to see the extent of overlap between 2021 and 2022 datasets and to study whether the longitudinal 2021 dataset on its own can predict 2022 responses.

joreynajr commented 1 year ago

Thanks for the clarification!

joreynajr commented 1 year ago

Working on a new branch and I started restructuring the repo: https://github.com/CMI-PB/cmi-pb-multiomics/tree/re-harmonize

joreynajr commented 1 year ago

I just have a few questions/suggestions:

2020_2021_specimen.csv and 2020_2021_subject.csv should be split up between
2022 files use a different name compared to 2020/2021 such as:
- "olink_prot_exp" versus "olink"
- "live_cell_percentages" versus "live_cell_percentage" (plural naming used for 2022)
- "ab_titer" versus "abtiter"
- remove "BD" from 2022 files

These seem like nit-picky things but I really think they should just be solved for the sake of everyone/anyone using this dataset.

joreynajr commented 1 year ago

I did some manual parsing and data-wrangling to get the following files:

This is what should be available for download from: https://www.cmi-pb.org/downloads/cmipb_challenge_datasets/2nd_cmipb_challenge/10202022/

joreynajr commented 1 year ago

I found another tricky issue, when downloading the following files they are actually saved using tabs as delimiters but the extension says ".csv":

2020_abtiters.csv, 2020_live_cell_percentage.csv, 2020_olink.csv, 2020_rnaseq.csv
2021_abtiter.csv, 2021_live_cell_percentage.csv, 2021_olink.csv, 2021_rnaseq.csv

I've manually changed this on my end but definitely needs to be fixed.

joreynajr commented 1 year ago

I also ran into another issue, for olink some proteins are given using two ids, here are some examples of that happening:

Q29983,Q29980
P29459,P29460
Q14213,P29459

joreynajr commented 1 year ago

I'm going to handle this by adding a "-" between multiple names and creating a new file.

Pramod: Perfect!!

joreynajr commented 1 year ago

I actually noticed that the 2022 olink file, "2022BD_olink_protexp.csv", uses "" so I'll switch over to this for multiple names

joreynajr commented 1 year ago

Another issue, 2020/2021 olink files don't have a header but 2022 does. The files use to have this header but now don't?

joreynajr commented 1 year ago

Actually, the problem is not so much that there isn't a header but the header changed. 2020 uses:

specimen_id,olink_id,uniprot_id,prot_exp,unit,lloq,uloq,QC

2021 uses:

specimen_id,olink_id,uniprot_id,prot_exp,unit,lloq,uloq,QC

and 2022 uses:

uniprot_id,specimen_id,lower_limit_of_quantitation,protein_expression,quality_control,unit,upper_limit_of_quantitation

pramodsshinde commented 1 year ago

Hi @joreynajr

Thanks for noting down all the issues. I worked on the issue and the modified files. These files can be accessible here. Please let me know if you note any additional issues.

pramodsshinde commented 1 year ago

I also ran into another issue, for olink some proteins are given using two ids, here are some examples of that happening:

Q29983,Q29980

P29459,P29460

Q14213,P29459

These are basically protein complexes that's why two Uniprot ids are associated with them. We are working on creating ontology terms for them, but it will take some time (for 3rd challenge).

pramodsshinde commented 1 year ago

I found another tricky issue, when downloading the following files they are actually saved using tabs as delimiters but the extension says ".csv":
2020_abtiters.csv, 2020_live_cell_percentage.csv, 2020_olink.csv, 2020_rnaseq.csv
2021_abtiter.csv, 2021_live_cell_percentage.csv, 2021_olink.csv, 2021_rnaseq.csv
I've manually changed this on my end but definitely needs to be fixed.

This problem is fixed now. I changed file extensions to tsv and added \t as the delimiter.

joreynajr commented 1 year ago

I still see the same thing on my end when I use this link: https://www.cmi-pb.org/downloads/cmipb_challenge_datasets/2nd_cmipb_challenge/10202022/

pramodsshinde commented 1 year ago

Sorry for the confusion. I created a new file dir here: https://www.cmi-pb.org/downloads/cmipb_challenge_datasets/2nd_cmipb_challenge/11282022/

joreynajr commented 1 year ago

Thanks! I'll start taking a look

CMI-PB / cmi-pb-multiomics-jive

Creating harmonized datasets for 2nd prediction challenge #4