csarven / doingbusiness-linked-data

Doing Business Linked Data
Other
1 stars 0 forks source link

CSV delimiters and file extension #3

Open csarven opened 9 years ago

csarven commented 9 years ago
reni99 commented 9 years ago

Alright, I change it

2014-12-30 13:02 GMT+01:00 Sarven Capadisli notifications@github.com:

  • Use comma (,) as a delimiter for the values.
  • Optionally, use double-quotes (") for the text delimiter.
  • Use well-known filename extension .csv (instead of .txt).

— Reply to this email directly or view it on GitHub https://github.com/csarven/doingbusiness-linked-data/issues/3.

reni99 commented 9 years ago

One question here.. the reason why I used the comma (;) as delimiter, is that there are commas in some economy labels. When I change back by using commas, how would you threat the ones in the economy labels? Do you mask it with some unique string and then change it back afterwards? like sed "/s/, /XXX/g"...

csarven commented 9 years ago

That's exactly what the quotes are for e.g., "foo, bar", "baz", is a row with columns where "foo, bar" is the first value, and "baz" is the second. The comma within the quotes is a literal comma, and the comma outside is used as a delimiter.

The literal comma can also be escaped with a character, e.g., foo\, bar, baz, instead of using quotes.

There are different rules/ways to do this. Look it up (see also https://tools.ietf.org/html/rfc4180 ). Stick to one style throughout the scripts - will make your life easier.

Note: Leave the numerics alone i.e., don't quote them. And, use period . for decimal values.

reni99 commented 9 years ago

I tried to get it done with quotes and also tried \,.. With both I ran into problems with sorting/merging. I tried everything, but at the end I had to go back to my initial idea with masking. Would be nice to have it a bit cleaner, but at least with the current solution the merging works...

But there is one more thing: There are economies which don't have the same 3letter code as the World Bank codes. The ones here:

Romania,DB2004,,,,,,,,,,,, TaiwanXXXChina,DB2004,,,,,,,,,,,, Timor-Leste,DB2004,,,,,,,,,,,, West Bank and Gaza,DB2004,,,,,,,,,,,,

How should I treat these?

2014-12-30 14:51 GMT+01:00 Sarven Capadisli notifications@github.com:

That's exactly what the quotes are for e.g., "foo, bar", "baz", is a row with columns where "foo, bar" is the first value, and "baz" is the second. The comma within the quotes is a literal comma, and the comma outside is used as a delimiter.

The literal comma can also be escaped with a character, e.g., foo\, bar, baz, instead of using quotes.

There are different rules/ways to do this. Look it up. Stick to one style throughout the scripts - will make your life easier.

— Reply to this email directly or view it on GitHub https://github.com/csarven/doingbusiness-linked-data/issues/3#issuecomment-68357582 .

csarven commented 9 years ago
  1. Ahh.. please find appropriate tooling or command-line working. Certainly what you are experiencing is nothing new. Does a value like TaiwanXXXChina look okay to you? The problem with that is that, it introduces some semantics into the data which can only be reliably interpreted by you and the software you are using. If someone else were to pick that up, and try to use their own software, they may have to do more hacking than necessary i.e., instead of simply just dealing with "Taiwan, China" or something.
  2. Have you tested the CSV output in tarql? If yes and there are no problems, you can temporarily put 1 from above aside for now, but please get back to it later :)
  3. Ok, small change. Use https://raw.githubusercontent.com/datasets/country-codes/master/data/country-codes.csv (get a copy, and track of the URL for provenance) instead of WB SPARQL endpoint. After the join, use PS for the 2-letter code for "West Bank and Gaza". In parallel, contact DB and ask whether they consider or acknowledge West Bank and Gaza for the State of Palestine. See also http://en.wikipedia.org/wiki/ISO_3166-2:PS . Update: Note also that World Bank Group Finances identifies "West Bank and Gaza" https://finances.worldbank.org/countries/West%20Bank%20and%20Gaza and uses code GZ. This is not an ISO 3166 code. I suspect that DB will acknowledge GZ as the 2-letter code instead of PS.
csarven commented 9 years ago

CSV tools to consider:

See if you can do something at earlier steps. Introducing a different delimiter at ssconvert for instance is not a good idea. I really think this is making it more complicated than it needs to be.

reni99 commented 9 years ago

Ahh.. please find appropriate tooling or command-line working. Certainly what you are experiencing is nothing new. Does a >value like TaiwanXXXChina look okay to you? The problem with that is that, it introduces some semantics into the data which >can only be reliably interpreted by you and the software you are using. If someone else were to pick that up, and try to use their >own software, they may have to do more hacking than necessary i.e., instead of simply just dealing with "Taiwan, China" or >something.

Have you tested the CSV output in tarql? If yes and there are no problems, you can temporarily put 1 from above aside for now, >but please get back to it later :) It is definitely something that bothers me too.. I am still investigating here and then. But I continue further on for now. I am doing the mapping with tarql now and update TI.

How about ILO?

Ok, small change. Use https://raw.githubusercontent.com/datasets/country-codes/master/data/country-codes.csv (get a copy, >and track of the URL for provenance) instead of WB SPARQL endpoint. After the join, use PS for the 2-letter code for "West >Bank and Gaza". In parallel, contact DB and ask whether they consider or acknowledge West Bank and Gaza for the State of >Palestine. See also http://en.wikipedia.org/wiki/ISO_3166-2:PS Okey, I will do it this way.

CSV tools to consider:

See if you can do something at earlier steps. Introducing a different delimiter at ssconvert for instance is not a good idea. I really >think this is making it more complicated than it needs to be. Thx for the hints!

2014-12-30 20:58 GMT+01:00 Sarven Capadisli notifications@github.com:

CSV tools to consider:

See if you can do something at earlier steps. Introducing a different delimiter at ssconvert for instance is not a good idea. I really think this is making it more complicated than it needs to be.

— Reply to this email directly or view it on GitHub https://github.com/csarven/doingbusiness-linked-data/issues/3#issuecomment-68391608 .