cwrc / HuViz

LOD visualization tool for humanities datasets
8 stars 1 forks source link

make orlandoScrape support extracting authors as either subject or object (4 hrs) #37

Open smurp opened 9 years ago

smurp commented 9 years ago

Think about Shakespeare and how the current shakwi dataset is so small. That is because it only contains triples where shakwi is the subject. This would have a huge impact on what the test datasets would be capable of. I propose this for the "Ready for User Testing" milestone.

ilovan commented 9 years ago

Decision: Yes, we want this. This may increase the size of the datasets exponentially (see #35 ). If we implement it, we have to be mindful of the size of the testing datasets, since we will deal with #35 after user testing.

Question to @smurp : Once this is implemented, will we have to enrich the Xpath selection to take advantage of this functionality?

smurp commented 9 years ago

We'd want this capability to be optional and hence would merely add the possibility of datasets which span the corpus more readily. We would control the data size by restricting the predicate set.

cmiya commented 9 years ago

I think it has to go both ways, otherwise it would just be repeating the same information that was in the source article. But I agree we have to be mindful of the dataset size - if we could restrict it temporarily, by capping the number of predicates (purely for test purposes) that would be helpful.

smurp commented 9 years ago

So the question is: For usability testing should we have some datasets which rely on being able to perform extracts where a particular writer appears as object of triples?

BTW @cmiya I don't think this ought to have any impact on the XPath rules -- which govern the triples that are even recognized -- because this is a question of which triples happen to be included in a particular dataset.

ilovan commented 9 years ago

I am a little nervous about doing this prior to dealing with #35.

@smurp , I am not sure I understand what you mean by making this feature optional: Optional for whom, for the users or for the administrators? Also, how would "restricting the predicate set" work? Would this be applicable to specific data sets or to all datasets?

SusanBrown commented 9 years ago

I think this would be very valuable and make the datasets more interesting. If we can do it within 4 hours, that would be worth it, but if it spills past that we should pull back for testing purposes.

Ideally too (but not now) we should include any connections between any two nodes in the current dataset in order to make any subset more accurate as a relationship of the overall graph. I'll make a separate github ticket for that.

cmiya commented 8 years ago

@smurp I'd also like to hear more about how to control the dataset size by restricting the predicate set. Would this be decided on a case by case basis after testing the test datasets and seeing how they perform? I'm assuming this would just be a temporary measure that hopefully won't be necessary after we tackle some of the issues that are causing crashes like #35

Like @ilovan, I'm a little worried that this might be an added complexity, in terms of increasing the size and causing crashes. I'm wondering if it's something that we should definitely strive for, but only once the bug has been fixed (i.e. after the first round of testing), so it won't affect the overall performance.

smurp commented 8 years ago

By "optional" I am meaning that we would still be able to create datasets using the current logic (ie asking for shakwi only gives us triples from the shakwi ENTRY) OR we would (by adding a switch to the orlandoScrape.py command line called --spanning or somesuch) be able to optionally indicate that we would like all the triples which include shakwi either as the subject (ie from the shakwi ENTRY) or as the object (ie from other entries).

So this would have no impact other than to permit us to create extracts which are broader in this way, it would not have to affect all the extractions we perform. My purpose in suggesting it was to cope with the possibility that your user testing would be easier with the more interesting datasets made possible by this technique. We would certainly have to be very sensitive about the sizes of the extracts made in this way, which we could control by constraining the predicates which are emitted -- using a facility which already exists. @cmiya Yes I would imagine that we would run an extract and see that the size was just too great and then go back and constrain the predicates. Doing so requires explicit itemization of the desired predicates, sadly, for there is not yet a feature which (based on the predicate hierarchy present in the ontology) is capable of restricting predicate extraction simply by specifying general predicates and having only their descendants appear in the extract. A feature for another day.

It shouldn't take long to implement but runtimes will be significant and the auditing of resultant file sizes and iterative refinement of the extracts will take time -- that's a self-limiting process though, in other words we would use the feature as much as we can afford to.

smurp commented 8 years ago

Certainly this feature could be deferred till after testing. My purpose in suggesting it was that I thought our datasets could be more interesting and relevant to users with this facility.

ghost commented 8 years ago

I thought we had decided that we wanted to do this.

On Oct 30, 2015, at 2:02 PM, Shawn Murphy notifications@github.com wrote:

Certainly this feature could be deferred till after testing. My purpose in suggesting it was that I thought our datasets could be more interesting and relevant to users with this facility.

— Reply to this email directly or view it on GitHub https://github.com/cwrc/HuViz/issues/37#issuecomment-152603656.

cmiya commented 8 years ago

@smurp Thanks for clearing that up! I'm glad to know it's a self-limiting procedure, so we will only take on as much as we are able. @SusanBrown I'll include this task in the scoping document (along with the updated tally of projected hours) and send it to you and Shawn asap.

cmiya commented 8 years ago

@SusanBrown @smurp Should we postpone this until after user testing? I don't think it's something that will have a huge influence on the type of feedback we get from users. I'm also thinking it might be less critical than addressing other issues, like the ability of the tool to handle large (or larger) datasets without crashing. I suggest we remove it from the testing milestone and reevaluate once we have had a team meeting.

ghost commented 8 years ago

I think the datasets would be more interesting if they came from a greater range of sources. As I understand it, right now they are essentially visualizations of the relationships within a single document. So I think if this can be done in the time estimated it would be good. If it’s going to take a lot longer then let’s defer it. (I realize I’ve reversed my position on this—partly as a result of more use of the tool.)

There are other serious problems with the data related to what Orlando Scrape is scraping but they will take some detailed work to sort out.

On Mar 7, 2016, at 6:12 PM, cmiya notifications@github.com wrote:

@SusanBrown https://github.com/SusanBrown @smurp https://github.com/smurp Should we postpone this until after user testing? I don't think it's something that will have a huge influence on the type of feedback we get from users. I'm also thinking it might be less critical than addressing other issues, like the ability of the tool to handle large (or larger) datasets without crashing. I suggest we remove it from the testing milestone and reevaluate once we have had a team meeting.

— Reply to this email directly or view it on GitHub https://github.com/cwrc/HuViz/issues/37#issuecomment-193497978.

cmiya commented 8 years ago

In that case, @smurp do you know when you'll have this task completed by? I believe it's the last major one connected with this milestone.

smurp commented 8 years ago

Proposal: remove from this milestone

There is a choice to be made about how to proceed with this issue -- and I think it is to revisit this matter once we have a quadstore.

I've already put in more than the estimated amount of time on the SQLite approach but there is more to do. So, alternatives at the moment are:

Alternatives:

Extract everything to SQLite using orlandoScrape.py

Add an sqlite output option to orlandoScrape.py so it can store everything in an SQL db and then write another emitter which can generate output for huviz which comes from the SQL db (5 hrs already and another 5hrs at least)

Make orlandoScrape.py more complicated

Radically transform the interior of orlandoScrape.py so it can work both inside out and backwards to perform this task AND run very slowly (10+ hrs of really pointless work). This is an investment in orlandoScrape which is tangential to most goals.

Javascript version of orlandoScrape to fill an rdflibjs quadstore

Write a minimalist version of orlandoScrape.js in javascript using rdflibjs and scrape everything into one of its quadstore technologies so .nq (or .trig?) extracts for huviz can be generated from it AND then we can also query it with SPARQL (10 to 20 hrs)

Use an arbitrary quadstore

Run an extract of EVERYTHING from orlandoScrape.py then import it into an arbitrary quadstore and generate static huviz extracts from it. (10+ hrs) Arguably that quadstore would be some part of rdflibjs to maximally benefit from its other capabilities but that is certainly not obvious -- other quadstores might serve broader goals better or a different .js ecosystem (based on https://github.com/RubenVerborgh/N3.js for example) might be more strategic, performant or powerful.

SusanBrown commented 8 years ago

So is it right to conclude from this that once we have a quadstore this will be much more straightforward, less time-consuming, and less prone to having to be redone. If yes, then defer.

On Mar 14, 2016, at 7:18 PM, Shawn Murphy notifications@github.com wrote:

There is a choice to be made about how to proceed with this issue -- and I think it is to revisit this matter once we have a quadstore.

I've already put in more than the estimated amount of time on the SQLite approach but there is more to do. So, alternatives at the moment are:

Extract everything to SQLite using orlandoScrape.py

Add an sqlite output option to orlandoScrape.py so it can store everything in an SQL db and then write another emitter which can generate output for huviz which comes from the SQL db (5 hrs already and another 5hrs at least)

Make orlandoScrape.py more complicated

Radically transform the interior of orlandoScrape.py so it can work both inside out and backwards to perform this task AND run very slowly (10+ hrs of really pointless work). This is an investment in orlandoScrape which is tangential to most goals.

Javascript version of orlandoScrape to fill an rdflibjs quadstore

Write a minimalist version of orlandoScrape.js in javascript using rdflibjs and scrape everything into one of its quadstore technologies so .nq (or .trig?) extracts for huviz can be generated from it AND then we can also query it with SPARQL (10 to 20 hrs)

Use an arbitrary quadstore

Run an extract of EVERYTHING from orlandoScrape.py then import it into an arbitrary quadstore and generate static huviz extracts from it. (10+ hrs) Arguably that quadstore would be some part of rdflibjs to maximally benefit from its other capabilities but that is certainly not obvious -- other quadstores might serve broader goals better or a different .js ecosystem (based on https://github.com/RubenVerborgh/N3.js https://github.com/RubenVerborgh/N3.js for example) might be more strategic, performant or powerful.

— Reply to this email directly or view it on GitHub https://github.com/cwrc/HuViz/issues/37#issuecomment-196482930.

smurp commented 8 years ago

I believe we are concluding that this goal should be addressed after a store is established.

antimony27 commented 6 years ago

@smurp @SusanBrown Can you tell me if this is something that still has to happen or if it should be closed?

SusanBrown commented 6 years ago

I think it’s moot. Better to put energies towards reading from TS. Shawn?

On Apr 6, 2018, at 8:43 AM, antimony27 notifications@github.com wrote:

@smurp https://github.com/smurp @SusanBrown https://github.com/SusanBrown Can you tell me if this is something that still has to happen or if it should be closed?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cwrc/HuViz/issues/37#issuecomment-379241673, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhUoMNW2kRxp0Pbv5c6AutbqvbZVAOmks5tl2LygaJpZM4GGquY.