ceurws / ToolsEswc

Tools and extensions for the SemPub2015 co-located with the Extended Semantic Web Conference 2015
GNU General Public License v3.0
0 stars 0 forks source link

not processed volumes #15

Open liyakun opened 8 years ago

liyakun commented 8 years ago

Some not processed volumes:

http://ceur-ws.org/Vol-41/

http://ceur-ws.org/Vol-1549/

S6savahd commented 8 years ago

good that we are listing these, please try to have them manually in the dataset, so we have a complete dataset as of today

clange commented 8 years ago

Vol-1549 is interesting; @S6savahd would you be able to talk to Sarven about this one? I wonder why it's not working because I thought it's technically the same as Vol-1550 and Vol-1551. The source code has changed completely, and so has the way the layout is computed, so the developers of the information extraction tool had no chance to adapt their tool to it, but these new volumes look the same as the old volumes so should work. These volumes are important because this will soon be the new standard format. Many of them (those created with ceur-make and Rohan's soon-finished web UI frontend) will have RDFa so won't require sophisticated information extraction, but others (those created manually) won't have RDFa. For the latter an adaptation of one of the other information extraction tools might work better; in any case all of these new volumes will have a very clean, uniform structure.

liyakun commented 8 years ago

@clange you are right, after filtering non relevant information with related to layout, Vol-1549, Vol-1550, Vol-1551 will not have any information, sorry that the list is not complete, below some data before filtering

<http://ceur-ws.org/Vol-1549/> <http://fitlayout.github.io/ontology/segmentation.owl#country> <http://dbpedia.org/resource/Australia> ;
    <http://fitlayout.github.io/ontology/segmentation.owl#icoloc> "" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#idateplace> "Proceedings of the 1st International Workshop on Semantic Statistics co-located with 12th International Semantic Web Conference (ISWC 2013), Sydney, Australia, October 11th, 2013" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#ienddate> "2013-10-11" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#iproceedings> "Proceedings of the 1st International Workshop on Semantic Statistics co-located with 12th International Semantic Web Conference (ISWC 2013), Sydney, Australia, October 11th, 2013" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#istartdate> "2013-10-11" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#isubmitted> "2016-03-15" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#ititle> "Semantic Statistics 2013" .

<http://ceur-ws.org/Vol-1550/> <http://fitlayout.github.io/ontology/segmentation.owl#related> <http://ceur-ws.org/Vol-1549/> ;
    <http://fitlayout.github.io/ontology/segmentation.owl#country> <http://dbpedia.org/resource/Italy> ;
    <http://fitlayout.github.io/ontology/segmentation.owl#icoloc> "" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#idateplace> "Proceedings of the 2nd International Workshop on Semantic Statistics co-located with 13th International Semantic Web Conference (ISWC 2014), Riva del Garda, Italy, October 19th, 2014" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#ienddate> "2014-10-19" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#iproceedings> "Proceedings of the 2nd International Workshop on Semantic Statistics co-located with 13th International Semantic Web Conference (ISWC 2014), Riva del Garda, Italy, October 19th, 2014" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#istartdate> "2014-10-19" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#isubmitted> "2016-04-23" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#ititle> "Semantic Statistics 2014" .

<http://ceur-ws.org/Vol-1551/> <http://fitlayout.github.io/ontology/segmentation.owl#related> <http://ceur-ws.org/Vol-1549/> , <http://ceur-ws.org/Vol-1550/> ;
    <http://fitlayout.github.io/ontology/segmentation.owl#icoloc> "" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#idateplace> "Proceedings of the 3rd International Workshop on Semantic Statistics, co-located with 14th International Semantic Web Conference (ISWC 2015), Bethlehem, U.S., October 11th, 2015" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#ienddate> "2015-10-11" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#iproceedings> "Proceedings of the 3rd International Workshop on Semantic Statistics, co-located with 14th International Semantic Web Conference (ISWC 2015), Bethlehem, U.S., October 11th, 2015" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#istartdate> "2015-10-11" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#isubmitted> "2016-03-15" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#ititle> "Semantic Statistics 2015" .

The original tool will process some "related volume" for Vol-1550 and Vol-1551, as the information come from index.html. If we remove layout information for these three volumes like other volumes, then no information will be left. I have tried to change the original tool before but no success, I will spend some more time on fixing the original tool, another way is that I write a separate script for these three volumes, and it should also work for similar volumes with these three if they are well structured.

liyakun commented 8 years ago

some information added for Vol-1549, Vol-1550 and Vol-1551 from indexing page in the new dataset, and also all the information from Vol-41.

liyakun commented 8 years ago

@S6savahd @clange I wrote a tool to process these three volumes, and it should also work with volumes in the same structure, the tool is ceurws.py, the output is 1549-1551.ttl and it can be extended to process other volumes as well.

S6savahd commented 8 years ago

great!

I didn't have a look into the code but is there a way that we have it embedded in the main code, I mean that for future we do not run them separately but all in once?

liyakun commented 8 years ago

@S6savahd It is possible to embed it into the post processing script we already have, but I need to check how to embed it into the original tool as they are written in different language. We can also extend this tool to process different structure in the future, as the original tool uses common strategy to process all the volumes, it will not always be able to process all the volumes information completely.