PNNL-CompBio / coderdata

Automation scripts and benchmark dataset package for cancer drug prediction deep learning models.
Other
11 stars 3 forks source link

broad_sanger samples file has quotes around all values - (breaking Polars during testing) #160

Closed jjacobson95 closed 4 months ago

jjacobson95 commented 7 months ago

Here is an attached screenshot comparing broad_sanger to cptac samples file. Quotes should be removed from all strings.

Screenshot 2024-04-30 at 8 42 57 AM
sgosline commented 7 months ago

ok, can you just look to see if there are any sample identifiers with commas in them?

jjacobson95 commented 7 months ago

Ahh I see, there are cancer_types, other_names, and common_names with commas. I'll close this and just build it into the package to check for this.

jjacobson95 commented 7 months ago

Sorry, re-opening. The column values are fine, but I think we should write the headers without quotes. I think any inter-dataset operations such as merge, concat, etc, won't work if the headers are different.

sgosline commented 7 months ago

polars has a quote_char argument that should handle this.

sgosline commented 7 months ago

i'd suggest using this instead of changing the underlying schema, as quotes will be scattered throughout.