NaegleLab / CoDIAC

Other
0 stars 0 forks source link

Improve reference file readability #7

Closed knaegle closed 1 year ago

knaegle commented 1 year ago

Is your feature request related to a problem? Please describe. Right now the domain reference (CSV) file is repeating the accession and gene name as part of the field of boundaries. This makes it hard to read and increases the number of fields that need to be parsed later.

Describe the solution you'd like I suggest we remove these from the boundaries field.

Tasks

Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at.

knaegle commented 1 year ago

@alekhyaa2 I fixed the Accession:GeneName from existing code. However, we also want to drop the brackets in this field and the explicit string. I.e. we want this to read like the following: SH3:4:80;Rho-GAP:109:295;SH2 1:330:425;SH2 2:622:716 Instead of this: ['SH3:4:80', 'Rho-GAP:109:295', 'SH2 1:330:425', 'SH2 2:622:716']

To fix one of the things, we should move back to ; separated fields to avoid the need for string grouping in a CSV (i.e. let's not use commas to separate things inside fields of a single column.