Closed Frijol closed 5 years ago
I tried looking for EIS's of a controversial sewage treatment plant near my home on these various tools… [when searching our prototype,] I got results, but none of them relevant to what I actually wanted… I tried searching the relevant terms in a Google Search, and immediately found what I was looking for:
The first thing that jumped out at me here that the EIS in question is from 2003, and the EPA database, which is the only data we have in the prototype, only goes back to 2012. So it’s a given that this was never going to be findable in our search.
I think we’d need to pick a newer topic if we wanted to test the potential utility of this kind of solution. More importantly, though, I feel like searching by name of facility (which often corresponds directly to the name of the document) isn’t a great use here — any of these would be easy to find through other means if you knew the name, and I think what the authors of that paper were getting at (both from the paper and from our meeting with them) is something more like: “can I search for salmon to see projects affected salmon habitats and how that was documented?” (See also your note about subject-based rather than title-based search; that was the main reason to start with full-text indexing, IIRC. It gets you partway to this without any need to figure out what tags or keywords to apply to every document.)
Full-text search: did we already accomplish this in the prototype?
This is really the only major thing we accomplished in this prototype :)
Try searching for “herbicide,” for example. That’s not in any titles, but you’ll get plenty of results.
Does Google Dataset Search do this? Also can't tell.
That one is also unclear to me, but it certainly seems like it’s more oriented towards things like tabular data, which might be components of EISes, but EISes themselves.
It's possible that at the moment, our demo accesses more [databases than Google Dataset Search]
Well, our demo only searches one (the EPA database), so this is either a totally divergent set from Google’s Dataset Search or a subset.
Direct link to downloadable PDF (again, can we do this? Or are we limited here?)
@b5 and @rgardaphe can answer this better, but IIRC there were two issues:
So, if I recalled that right, there’s no serious reason we couldn’t, except in cases where there was no actual document for us to download in the first place.
Non-standard formatting/content of documents: Wentz notes that a lack of formal standards makes it challenging to compare across documents & check accountability
I don’t think we were thinking about this point in particular, BUT if we could reasonably parse text out of it, we could always provide reformatted copies of every document in one standard HTML template. That’s a little iffy though — if you can only read 90% of the content, it’s still useful in a search index in a way that it might not be if you took that extracted text and tried to make a document from it again. You might be missing critical parts for a human who is reading it narratively.
Issues we might be able to work on (is this feasible/interesting for us?)… Record removal
I think the really important point here is that there has to both be very strong external interest and serious funding for these. If we are stepping up to be a reliable provider of long term storage here, that’s a big commitment, and we have to be careful that we and the data don’t suddenly disappear. I think we’d love to do that, but that means it really needs some serious, stable funding. And the argument we’ll face then is: why do we deserve the funding to do that more than, say, Hathi trust or Northwestern University Libraries, who are already sort of doing this?
are we in a position to provide something Google Dataset Search is not?
I think, as I said earlier, this actually isn’t a problem Google Dataset Search is really even attempting (I could definitely be wrong). And you certainly aren’t wrong about the other issues. As we’ve all said in meetings, to go a whole lot farther requires some serious commitment of time and energy from many people, finding both clever and just dumb brute-force solutions to things like: tagging docs, acquiring & scanning hard copies, building relationships (and ETL pipelines) to get digital copies from other libraries and agencies, etc. etc.
just noting what @Mr0grog says:
I think the really important point here is that there has to both be very strong external interest and serious funding for these. If we are stepping up to be a reliable provider of long term storage here, that’s a big commitment, and we have to be careful that we and the data don’t suddenly disappear. I think we’d love to do that, but that means it really needs some serious, stable funding. And the argument we’ll face then is: why do we deserve the funding to do that more than, say, Hathi trust or Northwestern University Libraries, who are already sort of doing this?
Probably we need to think of these as 2 quasi-separate issues:
The first of these is already difficult for us. The latter is very far away. In the medium term, we are not going to be come the metastable entity that provides long-term reliable access. An institutional partner would be way better suited to that task, whether it's IA or a University Library or potentially someone else. So that means there are at least 2 blocking issues:
Probably we need to have a pretty high chance of success on both of those before we invest any more time in the project :-/
👍 on that.
Probably we need to have a pretty high chance of success on both of those
Cannot agree enough here. The civic tech landscape is littered with projects that only achieve useful code without any institutionalization or long-term effort, which often means they don’t have any real utility at all.
we would need money to pay developers
+ people building relationships with other orgs, people sitting and analyzing/tagging/geotagging docs, people acquiring & scanning docs, etc. I want to re-emphasize that a huge amount of the effort required to solve the problems in the paper is not software programming (it could be developers doing all these roles, but also maybe not).
enthusiastic institutional partners willing to host this tool
Personally, I don’t feel that’s as big a deal. If we are sustainable enough to support ongoing activities around the above stuff, we ought to be sustainable for hosting (it’s almost certainly the far lesser cost).
On Oct 7, 2018 1:09 PM, Rob Brackett notifications@github.com wrote:enthusiastic institutional partners willing to host this tool
Personally, I don’t feel that’s as big a deal. If we are sustainable enough to support ongoing activities around the above stuff, we ought to be sustainable for hosting (it’s almost certainly the far lesser cost).
I'm not thinking of the money, but ifnthebtimenscale. We've been around for 2 years. We might or might not be around for 10 more,maybe. A library user will want a tool with a longer lifespan.
—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or mute the thread.
I didn’t only mean $$$ when I said “cost” ;)
Anyway, I think I hear you on that but also partly disagree a bit, @titaniumbones. But I also don’t think that discussion’s going to fit well in GitHub comment boxes here. Better on a call sometime.
^^ I agree about a call to facilitate that conversation and also appreciate the thoughtful questions so far!
(also: just wanted to note that I added a space on @titaniumbones' comment to separate out quote from reply (It took me a couple readthrus)
We could try to marshal a massive amount of volunteer labor for the paper issue a la data rescue, as we have paid coordination now... its nice that this would have a more limited scope (although even the thought of this likely triggered many people who have just recovered from Data Rescue). It would depend on institutional support/partnerships w Northwestern though probably. In terms of tech development I wonder how much of this could overlap w/ QRI
Updating with feedback from the paper's author:
Top features:
In the long run, having a broader universe of EIA documents would be great. But that might be too ambitious for this first update to the initial prototype.
She also had an interest in finding out whether Columbia U could provide resources (labor &/or hosting)– not a commitment, but promising
Closing as this conversation has resolved as follows:
tl;dr: key questions at bottom
Hi! I'm looking at this project & its roadmapping. As part of this process, I took a look at the original inspiration, the 2016 Wentz Paper.
Problem statement, as outlined in the Wentz paper
From the Wentz paper, I looked at the problem statement and classified outlined challenges by what I think we can and cannot do.
It has been two years since the paper was written, so these key points are also contextualized vs. the newly released Google Dataset Search.
Issues that we can tackle:
Issues that we are not in a position to approach:
Issues we might be able to work on (is this feasible/interesting for us?)
The elephant-in-room question is: are we in a position to provide something Google Dataset Search is not?
Having played around (slightly) with Google Dataset Search, I’m not very impressed. It seems they are running into several of the issues Wentz notes that I don’t think we’re in a position to tackle:
Edit: I had a quick user study here, but on reflection have removed it as not valid & distracting from the central conversation. It now lives here just for documentation purposes
OK, so what are the key questions?
I think this is a really cool project. But I also think that we need wholehearted "yes" answers to both of the above questions if we are to pursue it further.
Curious to hear thoughts on this.