Open reni99 opened 9 years ago
Yes, I still think it is a good idea to get as much out of tarql as possible. Bash acts as "glue". It shouldn't be the primary way to do data mapping. People invest a lot of time building mapping tools for good reasons. It is not necessarily desirable to do imperative based mapping when it can be done declaratively.
As for your coding questions. You should revisit how you produce the triples based on each line. For example, you don't need to produce the types for code:economy on every line. In fact, you should refactor your code so that the recurring triples are simply stated once in the CONSTRUCT.
Alternatively, you can have multiple passings for the same CSV file to generate different output. Given the data size, this is a very low cost way of doing it, and maintaining. Trust me, you don't want to be in a situation where you have to do hard debugging on your own code. Fixing some "visible" code declaratively in tarql is a lot easier than messing around in Bash.
I totally agree with you that we should reuse existing good tested tools whenever possible. And I also agree with you that I don't want to be in a situation where I have to do hard debugging my own code. But this task is not for tarql, since the tool is made for mapping and not for the creation of dsd files IMO.
Let's look at some output tarql would have to create:
#
# Ease of Doing Business dataset
#
dataset:ease-of-doing-business
a qb:DataSet ;
qb:structure structure:ease-of-doing-business ;
dcterms:title "Ease of Doing Business"@en ;
dcterms:issued "2014-12-04T00:00:00Z"^^xsd:dateTime ;
dcterms:modified "2014-12-04T00:00:00Z"^^xsd:dateTime ;
dcterms:creator
<http://renatostauffer.ch/#i> ,
<http://csarven.ca/#i> ;
dcterms:license <http://creativecommons.org/publicdomain/zero/1.0/> ;
.
From the above output tarql can not take anything out of the csv files. Because of that it is not a mapping task IMO. If we go on with the example:
structure:ease-of-doing-business
a qb:DataStructureDefinition ;
qb:component component-measure:rank ;
qb:component component-dimension:economy ;
qb:component component-dimension:refPeriod ;
qb:component component-dimension-indicator:ease-of-doing-business ;
qb:component component:overall-dtf ;
.
This is probably the best part to use tarql. But again every DataStructureDefinition of each indicator is different. This means doing some kind of SPQRQL query like the following for each indicator:
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX sdmx: <http://purl.org/linked-data/sdmx#>
PREFIX sdmx-attribute: <http://purl.org/linked-data/sdmx/2009/attribute#>
PREFIX sdmx-code: <http://purl.org/linked-data/sdmx/2009/code#>
PREFIX sdmx-concept: <http://purl.org/linked-data/sdmx/2009/concept#>
PREFIX sdmx-dimension: <http://purl.org/linked-data/sdmx/2009/dimension#>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX sdmx-metadata: <http://purl.org/linked-data/sdmx/2009/metadata#>
PREFIX doingbusiness: <http://doingbusiness.270a.info/>
PREFIX measure: <http://doingbusiness.270a.info/measure/>
PREFIX doingbusiness-dataset: <http://doingbusiness.270a.info/dataset/>
PREFIX doingbusiness-structure: <http://doingbusiness.270a.info/structure/>
PREFIX dataset: <http://doingbusiness.270a.info/dataset/>
PREFIX structure: <http://doingbusiness.270a.info/structure/>
PREFIX component: <http://doingbusiness.270a.info/component/>
PREFIX dimension: <http://doingbusiness.270a.info/dimension/>
PREFIX concept: <http://doingbusiness.270a.info/concept/>
PREFIX concept-indicator: <http://doingbusiness.270a.info/concept/indicator/>
PREFIX code: <http://doingbusiness.270a.info/code/>
PREFIX code-indicator: <http://doingbusiness.270a.info/code/indicator/>
PREFIX economy: <http://doingbuinsess.270a.info/code/economy/>
PREFIX component-dimension: <http://doingbusiness.270a.info/component/dimension/>
PREFIX component-measure: <http://doingbusiness.270a.info/component/measure/>
PREFIX component-attribute: <http://doingbusiness.270a.info/component/attribute/>
PREFIX component-dimension-indicator: <http://doingbusiness.270a.info/component/dimension/indicator/>
CONSTRUCT{
structure:ease-of-doing-business #The name ease-of-doing-business is not in the csv
a qb:DataStructureDefinition ;
qb:component component-measure:?rank ;
qb:component component-dimension:?economy ;
qb:component component-dimension:?refPeriod ;
qb:component component-dimension-indicator:?ease-of-doing-business ; #This is not in the csv
qb:component component:?overall-dtf .
}
FROM <../data/starting-a-business.2004.transformable.csv>
WHERE{
}
LIMIT 1
So would you suggest to write queries for all the indicators like the above?
The next few blocks of the dsd file are in the same category as the first. There is nothing in the csv that can create these blocks (means no mapping):
code-indicator:ease-of-doing-business
a skos:Concept ;
skos:inScheme code:indicator ;
skos:topConceptOf code:indicator ;
skos:prefLabel "Ease of Doing Business"@en ;
.
code:indicator
a skos:ConceptScheme ;
skos:prefLabel "Indicator"@en ;
skos:hasTopConcept
code-indicator:ease-of-doing-business,
code-indicator:dealing-with-construction-permits,
code-indicator:enforcing-contracts,
code-indicator:getting-credit,
code-indicator:getting-electricity,
code-indicator:paying-taxes,
code-indicator:protecting-minority-investors,
code-indicator:registering-property,
code-indicator:resolving-insolvency,
code-indicator:starting-a-business,
code-indicator:trading-across-borders ;
.
component:indicator-ease-of-doing-business
a qb:ComponentSpecification ;
qb:dimension dimension-indicator:ease-of-doing-business ;
.
component:indicator-dealing-with-construction-permits
a qb:ComponentSpecification ;
qb:dimension dimension-indicator:dealing-with-construction-permits ;
.
component:indicator-enforcing-contracts
a qb:ComponentSpecification ;
qb:dimension dimension-indicator:enforcing-contracts ;
.
component:indicator-getting-credit
a qb:ComponentSpecification ;
qb:dimension dimension-indicator:getting-credit ;
.
component:indicator-getting-electricity
a qb:ComponentSpecification ;
qb:dimension dimension-indicator:getting-electricity ;
.
component:indicator-paying-taxes
a qb:ComponentSpecification ;
qb:dimension dimension-indicator:paying-taxes ;
.
component:indicator-protecting-minority-investors
a qb:ComponentSpecification ;
qb:dimension dimension-indicator:protecting-minority-investors ;
.
component:indicator-registering-property
a qb:ComponentSpecification ;
qb:dimension dimension-indicator:registering-property ;
.
component:indicator-resolving-insolvency
a qb:ComponentSpecification ;
qb:dimension dimension-indicator:resolving-insolvency ;
.
component:indicator-starting-a-business
a qb:ComponentSpecification ;
qb:dimension dimension-indicator:starting-a-business ;
.
component:indicator-trading-across-borders
a qb:ComponentSpecification ;
qb:dimension dimension-indicator:trading-across-borders ;
.
Now as I said in the previous comment. The country codes are pretty much the only thing I see a useful application of tarql. This is actually a mapping here. Data from the csv file gets mapped:
code:economy
a sdmx:CodeList , skos:ConceptScheme ;
skos:hasTopConcept
economy:AF ;
economy:AL ;
economy:DZ ;
economy:AO ;
economy:AG ;
economy:AR ;
economy:AM ;
economy:AU ;
economy:AT ;
...
I try to accomplish that with tarql.
For the components there is the following problem: There are indicators that have the same components. If I would do the mapping with tarql, the final dsd would end up having multiple identical triples in the dsd (which is probably unwanted). For example: The component rank is in every csv file. So you would end up having like 11 rank components in the dsd when doing the tarql mapping.
The following bash code can handle this problem with sorting in just one line (first line):
sortedUniqueIndicators+=($(echo "${indicatorsToLoop[@]}" | tr ' ' '\n' | sort -u | tr '\n' ' '));
sortedUniqueIndicatorsLength=${#sortedUniqueIndicators[@]};
for ((i=0; i<${sortedUniqueIndicatorsLength}; i++));
do
echo "component:${sortedUniqueIndicators[$i]}
a qb:ComponentSpecification ;
qb:measure measure:${sortedUniqueIndicators[$i]} ;
." >> meta.ttl;
printf "\n" >> meta.ttl;
echo "measure:${sortedUniqueIndicators[$i]}
a qb:MeasureProperty
." >> meta.ttl;
printf "\n" >> meta.ttl;
done
This eventually outputs every component just once. I don't know how I should accomplish that with just tarql...
component:rank
a qb:ComponentSpecification ;
qb:measure measure:rank ;
.
measure:rank
a qb:MeasureProperty ;
.
component:dtf
a qb:ComponentSpecification ;
qb:measure measure:dtf ;
...
I hope this makes some sense why I still tend to have bash code in these situations :D
I didn't mean that everything must be done within tarql. Some things can be done from Bash e.g., adding a triple about a timestamp. At the same time, some triples can simply exist in static files. Basically you have to balance out where it makes sense to do things declaratively (including tarql, and static files), imperatively.
In response to the code snippets:
To summarize: create simple enough tarql scripts to work with. Don't worry about how many there are. Don't worry about minor redundancies. Use Bash to "manage" the whole thing, so that different processes can work together, and the data can move around easily. That's the point. There are times where you can let a Bash script keep some of the data, but that's just a matter of whether that particular data (which is stored in the script) will be used again by something else or not. If it will be, move it to a static file.
Let me know if this is clear or sufficient. Otherwise, show me in person next week.
Ok, I look for a way to do so.
I have another question: Is it okey to have something like the following? Or is this not really wanted?
#
# Ease of Doing Business dataset
#
dataset:ease-of-doing-business
a qb:DataSet ;
qb:structure structure:ease-of-doing-business ;
dcterms:title "Ease of Doing Business"@en ;
dcterms:license <http://creativecommons.org/publicdomain/zero/1.0/> ;
. dataset:ease-of-doing-business dcterms:issued "2014-12-04T00:00:00Z"^^xsd:dateTime ; dcterms:creator http://renatostauffer.ch/#i , http://csarven.ca/#i ; .
So you kind of separate the static and on the way generated triples.
Hey Sarven, I have a question concerning the use of tarql to create the dsd-file. You suggested to max the use of tarql to create the dsd-file.
I think this way it adds unnecessary complexity to the code. I mean you have to access the csv files anyway. If you do it with bash only, you are more flexible IMO.
Simple example:
I mean this adds unnecessary files and the code won't get shorter. Even worse: What do you do when it comes to the components? You don't want to repeat components that are already in the file. So again, pure bash offers more flexibility here than tarql.
In terms of the code:economy triples I see some usage, but also there. It seams that tarql is line based and gives me the following results:
... and so on. Which is certainly not desirable.
You really think it is a good idea to use tarql for this task? I think it ends up in:
My suggestion is to stick to pure bash and optimize this code.
I hope you understand what I want to say (It is a bit hard to explain).