eliteportal / data-models

data models for the elite project
https://eliteportal.github.io/data-models/
MIT License
1 stars 1 forks source link

Specify columns need to replace the corresponding column value #23

Open nlee-sage opened 6 months ago

nlee-sage commented 6 months ago

During curation, I found a metadata file that uses "specifyPlatformVersion". The values in "specifyPlatformVersion" need to replace the values in "platformVersion" because "OtherPlatformVersion" is not a helpful annotation and creates two columns contributors would have to search by.

This issue applies to all "specify" columns across templates.

nlee-sage commented 6 months ago

Code used as a quick fix for the issue

# import CSV file. In this case it is an assay metadata file
## Removes any empty columns, should be the specify columns only since they are optional
assay_metadata_df = assay_metadata_df.dropna(axis = 1, how = 'all')

# check specify columns for different values
import re
r = re.compile('specify*', flags = re.IGNORECASE)
specify_cols = [c for c in list(assay_metadata_df.columns) if bool(r.search(str(c)))]

# removing specify from the column names to search for normal terms
normal_cols = [re.sub('specify', "", s) for s in specify_cols]

# create link between columns
pairings = []
for c in normal_cols: 
    pairings.append([s for s in list(assay_metadata_df.columns) if bool(re.search(c, s, flags = re.IGNORECASE))])

print(pairings)

# test changes first before reassigning to main dataframe
assay_metadata_df_temp = assay_metadata_df.copy(deep = True)
for p in pairings: 
    print(p[0], p[1], sep = ' : ')
    assay_metadata_df_temp[p[0]] = assay_metadata_df_temp[p[1]]

    assay_metadata_df_temp = assay_metadata_df_temp.drop(columns = [p[1]])

print(assay_metadata_df_temp.head)
print(assay_metadata_df_temp.shape)