arw36 / willoughby-etal-2017-virus-interactions

Configuration to integrate bat virus-host interactions from Willoughby et al. for GloBI
0 stars 0 forks source link

suspicious column names and alignment #3

Closed jhpoelen closed 3 years ago

jhpoelen commented 3 years ago

hey @arw36 - I was just poking around GloBI and I noticed your new data additions . . . nice!

With some minor tweaks to your table, GloBI will be able to index your interactions:

  1. rename SourceTaxonName -> sourceTaxonName (case sensitive)
  2. instead of populating names in the targetOccurrenceId column, suggest to use targetTaxonName
  3. suggest to verify UTF-8 character encoding in references to avoid funny characters (like SetiŽn) (e.g., "Aguilar-SetiŽn,A.;Romero-Almaraz,M.L.;Sanchez-Hernandez,C.;Figueroa,R.; Juarez-Palma, L.P.; Garcia-Flores, M.M.; Vazquez-Salinas, C.; Salas-Rojas, M.; Hidalgo-Martinez, A.C.; PierlŽ, S.A. Dengue virus in Mexican bats. Epidemiol. Infect. 2008, 136, 1678-1")

Hope this helps.

arw36 commented 3 years ago

Fixed 1. and 2. For 3. I don't know how to verify UTF-8. I am able to save as UTF-16 txt then change to tsv.

jhpoelen commented 3 years ago

Wow, that was quick! About UTF-8 - as far as I know, many projects use UTF-8, but I haven't seen UTF-16 around much. Any particular reason you are using UTF-16 instead of UTF-8?

arw36 commented 3 years ago

I use MS Excel (v 16.46) for data cleaning and organization. For saving as txt, it gives a no encoding or UTF-16 encoding options. It does not give a UTF-8 encoding option. I then manually change to tsv as there is no "save as tsv file" option.

jhpoelen commented 3 years ago

Thanks for sharing the tools you use for data munching.

Would it help for GloBI to automatically check for, and index, interactions.txt in addition to keeping check for interactions.tsv (and interactions.csv)?

As far as the UTF-8 thing goes, it appears that the options are tucked away a bit (according to some pages I found). See e.g., https://social.technet.microsoft.com/Forums/en-US/b95d3770-01b7-421c-9eb1-6f0b38ce5b5c/saving-an-excel-document-as-tab-delimited-utf8-text-file .

Needless to say, if you'd rather use comma-separated values, that is supported also. I prefer IANA's tab-separated http://www.iana.org/assignments/media-types/text/tab-separated-values because of the ease of parsing . . . but hey, I understanding that different tools like different things.

curious to hear your thoughts on this.

arw36 commented 3 years ago

Oh, I didn't realize it already auto checks for csv files. I'll likely switch to this as MS Excel has UTF-8 encoding for csv.

jhpoelen commented 3 years ago

@arw36 sounds good! It should be sufficient to rename interactions.tsv --> interactions.csv and replace with the appropriate comma-separated UTF-8 content.

jhpoelen commented 3 years ago

Thanks for responding to my questions!