CHIELDOnline / CHIELD

10 stars 3 forks source link

Make number and type of columns consistent throughout the databese #714

Open rien333 opened 3 years ago

rien333 commented 3 years ago

Thanks for all the work so far!

I've been playing around with the csv files in python, and some of my scripts produced weird results. Turns out my scripts tripped over the fact that some csv files have some (generally empty) columns that are not found across all csv files.

Take for example evolang11_41.csv:

"X","TODO","bibref","Var1","Relation","Var2","Cor","Process","Topic","Stage","Type","Subtype","Test","Stat","Confirmed","Languages","Species","Confidence","Notes","General.description"
105,NA,"evolang11_41","gene: SOX10",">","neural crest cells","pos","","Molecular genetics","coevolution","review","neuroscience; genetics",NA,NA,NA,NA,NA,NA,"",NA

Or atkinson2015speaker.csv:

"","Var1","Relation","Var2","Cor","Topic","Stage","Type","Confirmed","Notes","bibref"
"1","variation: phonology",">","learning phonemic boundaries","pos",NA,NA,"review","yes","A number of studies have demonstrated the effect that input variability can have on the acquisition of phonemic (or tonal [13] contrasts.","atkinson2015speaker"

Given that these columns are not used, nor discussed in the CHIELD paper, it seems best to delete them.

The source of my problem stems from the fact that I want to process the csv files as python dictionaries, with keys corresponding to a particular column, like so:

import csv

fieldnames = ["Var1", "Relation", "Var2", "Cor", "Topic", "Stage", "Type", "Confirmed", "Notes", "bibref"]

with open("file.csv", newline='') as csvfile:
        next(csvfile) # skip the first line
        reader = csv.DictReader(csvfile, fieldnames=fieldnames)
        for row in reader:
            do_something (row["Var2"] ) 

Specifying the fieldnames parameter of DictReader in this way fails, however, since it assumes that the columns are always the same across files.


EDIT: my way around this is to not make any assumptions on the columns present in a file

import csv

with open("file.csv", newline='') as csvfile:
        fields = next(csvfile).strip() # get the column headers
        fields = [f.strip('"') for f in fields.split(',')]
        reader = csv.DictReader(csvfile, fieldnames=fieldnames)
        for row in reader:
            ...
rien333 commented 3 years ago

Another inconsistency is the fact that the columns of some files re not wrapped in quotes, while most are.

Take, for instance, dunbar2004gossip.csv:

Var1,Relation,Var2,Cor,Topic,Stage,Type,Confirmed,Notes,bibref

Again, this makes it somewhat more difficult/annoying to proces the raw data.

seannyD commented 3 years ago

Thanks for this note, you're correct. I'm using R's default csv reader, which is more tolerant of this inconsistency. But I can run a script to normalise everything.

seannyD commented 3 years ago

By the way, there are combined versions of the data as single csv files available here: https://github.com/CHIELDOnline/CHIELD/tree/master/data/db They are updated after every rebuild.

See https://chield.excd.org/downloads.html