csarven / doingbusiness-linked-data

Doing Business Linked Data
Other
1 stars 0 forks source link

revisit: whether to use tarql or Bash for data mapping? #6

Open reni99 opened 9 years ago

reni99 commented 9 years ago

Hey Sarven, I have a question concerning the use of tarql to create the dsd-file. You suggested to max the use of tarql to create the dsd-file.

I think this way it adds unnecessary complexity to the code. I mean you have to access the csv files anyway. If you do it with bash only, you are more flexible IMO.

Simple example:

for ((i=0; i<${arrayLength}; i++));
do
    //This would stay anyway, with or without tarql
    echo "#" >> meta.ttl;
    echo "# ${codeIndicatorLabels[$i]} dataset" >> meta.ttl;
    echo "#" >> meta.ttl;
    printf "\n" >> meta.ttl;

    //The name of the dataset (like starting-a-business etc.)are not accessible anywhere in the csv files
    //This means you have to add a loop to create a query for every indicator instead of the cod below 
    echo "dataset:${codeIndicators[$i]}
    a qb:DataSet ;
    qb:structure structure:${codeIndicators[$i]} ;
    dcterms:title \"${codeIndicatorLabels[$i]}\"@en ;" >> meta.ttl;
...
done

I mean this adds unnecessary files and the code won't get shorter. Even worse: What do you do when it comes to the components? You don't want to repeat components that are already in the file. So again, pure bash offers more flexibility here than tarql.

In terms of the code:economy triples I see some usage, but also there. It seams that tarql is line based and gives me the following results:

code:economy  rdf:type      sdmx:CodeList ;
    rdf:type            skos:ConceptScheme ;
    skos:hasTopConcept  economy:AF ;
    rdf:type            sdmx:CodeList ;
    rdf:type            skos:ConceptScheme ;
    skos:hasTopConcept  economy:AL ;
    rdf:type            sdmx:CodeList ;
    rdf:type            skos:ConceptScheme ;
    skos:hasTopConcept  economy:DZ ;
    rdf:type            sdmx:CodeList ;

... and so on. Which is certainly not desirable.

You really think it is a good idea to use tarql for this task? I think it ends up in:

  1. More code
  2. Unnecessary dependencies
  3. More complexity

My suggestion is to stick to pure bash and optimize this code.

I hope you understand what I want to say (It is a bit hard to explain).

csarven commented 9 years ago

Yes, I still think it is a good idea to get as much out of tarql as possible. Bash acts as "glue". It shouldn't be the primary way to do data mapping. People invest a lot of time building mapping tools for good reasons. It is not necessarily desirable to do imperative based mapping when it can be done declaratively.

As for your coding questions. You should revisit how you produce the triples based on each line. For example, you don't need to produce the types for code:economy on every line. In fact, you should refactor your code so that the recurring triples are simply stated once in the CONSTRUCT.

csarven commented 9 years ago

Alternatively, you can have multiple passings for the same CSV file to generate different output. Given the data size, this is a very low cost way of doing it, and maintaining. Trust me, you don't want to be in a situation where you have to do hard debugging on your own code. Fixing some "visible" code declaratively in tarql is a lot easier than messing around in Bash.

reni99 commented 9 years ago

I totally agree with you that we should reuse existing good tested tools whenever possible. And I also agree with you that I don't want to be in a situation where I have to do hard debugging my own code. But this task is not for tarql, since the tool is made for mapping and not for the creation of dsd files IMO.

Let's look at some output tarql would have to create:

#
# Ease of Doing Business dataset
#

dataset:ease-of-doing-business
    a qb:DataSet ;
    qb:structure structure:ease-of-doing-business ;
    dcterms:title "Ease of Doing Business"@en ;
    dcterms:issued "2014-12-04T00:00:00Z"^^xsd:dateTime ;
    dcterms:modified "2014-12-04T00:00:00Z"^^xsd:dateTime ;
    dcterms:creator
        <http://renatostauffer.ch/#i> ,
        <http://csarven.ca/#i> ;
    dcterms:license <http://creativecommons.org/publicdomain/zero/1.0/> ;
    .

From the above output tarql can not take anything out of the csv files. Because of that it is not a mapping task IMO. If we go on with the example:

structure:ease-of-doing-business
    a qb:DataStructureDefinition ;
    qb:component component-measure:rank ;
    qb:component component-dimension:economy ;
    qb:component component-dimension:refPeriod ;
    qb:component component-dimension-indicator:ease-of-doing-business ;
    qb:component component:overall-dtf ;
.

This is probably the best part to use tarql. But again every DataStructureDefinition of each indicator is different. This means doing some kind of SPQRQL query like the following for each indicator:

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
PREFIX dcterms: <http://purl.org/dc/terms/> 
PREFIX foaf: <http://xmlns.com/foaf/0.1/> 
PREFIX qb: <http://purl.org/linked-data/cube#> 
PREFIX sdmx: <http://purl.org/linked-data/sdmx#> 
PREFIX sdmx-attribute: <http://purl.org/linked-data/sdmx/2009/attribute#> 
PREFIX sdmx-code: <http://purl.org/linked-data/sdmx/2009/code#> 
PREFIX sdmx-concept: <http://purl.org/linked-data/sdmx/2009/concept#> 
PREFIX sdmx-dimension: <http://purl.org/linked-data/sdmx/2009/dimension#> 
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#> 
PREFIX sdmx-metadata: <http://purl.org/linked-data/sdmx/2009/metadata#> 
PREFIX doingbusiness: <http://doingbusiness.270a.info/> 
PREFIX measure: <http://doingbusiness.270a.info/measure/> 
PREFIX doingbusiness-dataset: <http://doingbusiness.270a.info/dataset/> 
PREFIX doingbusiness-structure: <http://doingbusiness.270a.info/structure/> 
PREFIX dataset: <http://doingbusiness.270a.info/dataset/>
PREFIX structure: <http://doingbusiness.270a.info/structure/> 
PREFIX component: <http://doingbusiness.270a.info/component/> 
PREFIX dimension: <http://doingbusiness.270a.info/dimension/> 
PREFIX concept: <http://doingbusiness.270a.info/concept/> 
PREFIX concept-indicator: <http://doingbusiness.270a.info/concept/indicator/> 
PREFIX code: <http://doingbusiness.270a.info/code/> 
PREFIX code-indicator: <http://doingbusiness.270a.info/code/indicator/> 
PREFIX economy: <http://doingbuinsess.270a.info/code/economy/>
PREFIX component-dimension: <http://doingbusiness.270a.info/component/dimension/>
PREFIX component-measure: <http://doingbusiness.270a.info/component/measure/> 
PREFIX component-attribute: <http://doingbusiness.270a.info/component/attribute/>
PREFIX component-dimension-indicator: <http://doingbusiness.270a.info/component/dimension/indicator/>

CONSTRUCT{
structure:ease-of-doing-business #The name ease-of-doing-business is not in the csv
    a qb:DataStructureDefinition ;
    qb:component component-measure:?rank ;
    qb:component component-dimension:?economy ;
    qb:component component-dimension:?refPeriod ;
    qb:component component-dimension-indicator:?ease-of-doing-business ; #This is not in the csv
    qb:component component:?overall-dtf .    
}
FROM <../data/starting-a-business.2004.transformable.csv>
WHERE{

}
LIMIT 1

So would you suggest to write queries for all the indicators like the above?

The next few blocks of the dsd file are in the same category as the first. There is nothing in the csv that can create these blocks (means no mapping):

code-indicator:ease-of-doing-business
    a skos:Concept ;
    skos:inScheme code:indicator ;
    skos:topConceptOf code:indicator ;
    skos:prefLabel "Ease of Doing Business"@en ;
    . 

code:indicator
a skos:ConceptScheme ;
skos:prefLabel "Indicator"@en ;
skos:hasTopConcept
    code-indicator:ease-of-doing-business,
    code-indicator:dealing-with-construction-permits,
    code-indicator:enforcing-contracts,
    code-indicator:getting-credit,
    code-indicator:getting-electricity,
    code-indicator:paying-taxes,
    code-indicator:protecting-minority-investors,
    code-indicator:registering-property,
    code-indicator:resolving-insolvency,
    code-indicator:starting-a-business,
    code-indicator:trading-across-borders ;
.

component:indicator-ease-of-doing-business
    a qb:ComponentSpecification ;
    qb:dimension dimension-indicator:ease-of-doing-business ;
    .
component:indicator-dealing-with-construction-permits
    a qb:ComponentSpecification ;
    qb:dimension dimension-indicator:dealing-with-construction-permits ;
.
component:indicator-enforcing-contracts
    a qb:ComponentSpecification ;
    qb:dimension dimension-indicator:enforcing-contracts ;
    .
component:indicator-getting-credit
    a qb:ComponentSpecification ;
    qb:dimension dimension-indicator:getting-credit ;
    .
component:indicator-getting-electricity
    a qb:ComponentSpecification ;
    qb:dimension dimension-indicator:getting-electricity ;
    .
component:indicator-paying-taxes
    a qb:ComponentSpecification ;
    qb:dimension dimension-indicator:paying-taxes ;
    .
component:indicator-protecting-minority-investors
    a qb:ComponentSpecification ;
    qb:dimension dimension-indicator:protecting-minority-investors ;
    .
component:indicator-registering-property
    a qb:ComponentSpecification ;
    qb:dimension dimension-indicator:registering-property ;
    .
component:indicator-resolving-insolvency
    a qb:ComponentSpecification ;
    qb:dimension dimension-indicator:resolving-insolvency ;
    .
component:indicator-starting-a-business
    a qb:ComponentSpecification ;
    qb:dimension dimension-indicator:starting-a-business ;
    .
component:indicator-trading-across-borders
    a qb:ComponentSpecification ;
    qb:dimension dimension-indicator:trading-across-borders ;
    .

Now as I said in the previous comment. The country codes are pretty much the only thing I see a useful application of tarql. This is actually a mapping here. Data from the csv file gets mapped:

code:economy
    a sdmx:CodeList , skos:ConceptScheme ;
    skos:hasTopConcept
        economy:AF ;
        economy:AL ;
        economy:DZ ;
        economy:AO ;
        economy:AG ;
        economy:AR ;
        economy:AM ;
        economy:AU ;
        economy:AT ;
        ...

I try to accomplish that with tarql.

For the components there is the following problem: There are indicators that have the same components. If I would do the mapping with tarql, the final dsd would end up having multiple identical triples in the dsd (which is probably unwanted). For example: The component rank is in every csv file. So you would end up having like 11 rank components in the dsd when doing the tarql mapping.

The following bash code can handle this problem with sorting in just one line (first line):

sortedUniqueIndicators+=($(echo "${indicatorsToLoop[@]}" | tr ' ' '\n' | sort -u | tr '\n' ' '));
sortedUniqueIndicatorsLength=${#sortedUniqueIndicators[@]};

for ((i=0; i<${sortedUniqueIndicatorsLength}; i++));
do
    echo "component:${sortedUniqueIndicators[$i]}
    a qb:ComponentSpecification ;
    qb:measure measure:${sortedUniqueIndicators[$i]} ;
    ." >> meta.ttl;
    printf "\n" >> meta.ttl;

    echo "measure:${sortedUniqueIndicators[$i]}
    a qb:MeasureProperty
    ." >> meta.ttl;
    printf "\n" >> meta.ttl;
done

This eventually outputs every component just once. I don't know how I should accomplish that with just tarql...

component:rank
    a qb:ComponentSpecification ;
    qb:measure measure:rank ;
    .
measure:rank
    a qb:MeasureProperty ;
    .
component:dtf
    a qb:ComponentSpecification ;
    qb:measure measure:dtf ;
...

I hope this makes some sense why I still tend to have bash code in these situations :D

csarven commented 9 years ago

I didn't mean that everything must be done within tarql. Some things can be done from Bash e.g., adding a triple about a timestamp. At the same time, some triples can simply exist in static files. Basically you have to balance out where it makes sense to do things declaratively (including tarql, and static files), imperatively.

In response to the code snippets:

  1. Move this code to a static Turtle file. The only triple that Bash should create is execution related stuff, like the timestamp. If you really can't use a tool to get the data out, then you have to manually enter, which may be in a static file. This also helps later on to re-use this file to get a hold of something in particular.
  2. Yes, write a unique tarql for each dataset structure. The cost of writing that and being able to debug is very cheap! And please, no terms like "transformable" in the filename!
  3. I don't have the data in front of me right now to investigate, but from memory, I'll have to say that it should be possible. What's the exact problem? Looking at a huge block of code doesn't help me identify what your issue is exactly. If it is about the concept or dimension URIs, why can't you get that? I presume the column name is available to you, and that you can normalize it?
  4. Don't do Bash for this. You'll have to show me the tarql/SPARQL for it to see if it can be improved. Look, it is not that much of a problem to have some redundant triples as worst case here. Keep in mind that, if you are going to generate the components, you essentially do a single pass. You don't have to have a single tarql that does both data and metadata stuff. Break it apart. If you need to break apart them further even, do that as well. The beauty of dealing with consistent identifier use is that, when all of those triples go into a graph store, the redundancy is gone. Even without a store, that redundancy (even though it is extremely cheap here) is really a non-issue (e.g., same 11 triples? hah! C'mon. It costs more to think about how to deal with that "issue" than to simply work with it), because you can convert to N-Triples, cat to a single file, then sort and keep uniques. Done!

To summarize: create simple enough tarql scripts to work with. Don't worry about how many there are. Don't worry about minor redundancies. Use Bash to "manage" the whole thing, so that different processes can work together, and the data can move around easily. That's the point. There are times where you can let a Bash script keep some of the data, but that's just a matter of whether that particular data (which is stored in the script) will be used again by something else or not. If it will be, move it to a static file.

Let me know if this is clear or sufficient. Otherwise, show me in person next week.

reni99 commented 9 years ago

Ok, I look for a way to do so.

I have another question: Is it okey to have something like the following? Or is this not really wanted?

#
# Ease of Doing Business dataset
#

dataset:ease-of-doing-business
    a qb:DataSet ;
    qb:structure structure:ease-of-doing-business ;
    dcterms:title "Ease of Doing Business"@en ;
    dcterms:license <http://creativecommons.org/publicdomain/zero/1.0/> ;

. dataset:ease-of-doing-business dcterms:issued "2014-12-04T00:00:00Z"^^xsd:dateTime ; dcterms:creator http://renatostauffer.ch/#i , http://csarven.ca/#i ; .

So you kind of separate the static and on the way generated triples.