MetaCell / asu-olfactory

MIT License
0 stars 0 forks source link

Fix delimeter issue that will pull a big block of data #28

Closed enicolasgomez closed 1 year ago

jrmartin commented 2 years ago

@enicolasgomez The problem here happens with the way we are parsing the CID file in the normalize script. https://github.com/MetaCell/asu-olfactory/blob/feature/20/applications/pub-chem-index/tasks/ingestion/normalize.py#L50

The CID file has mainly 1 CID/Synonym per row, in which the delimiter is a single space. Some rows though the CID/Synonym is divided by a Tab, in these cases the normalize script is not splitting these rows, because the delimiter we are using in line 50 doesn't detect the Tabs.

CID where the delimiter issue happens are : 76540 22628