Assessement & Scoping of this Project

Frijol commented 5 years ago

tl;dr: key questions at bottom

Hi! I'm looking at this project & its roadmapping. As part of this process, I took a look at the original inspiration, the 2016 Wentz Paper.

Problem statement, as outlined in the Wentz paper

From the Wentz paper, I looked at the problem statement and classified outlined challenges by what I think we can and cannot do.

It has been two years since the paper was written, so these key points are also contextualized vs. the newly released Google Dataset Search.

Issues that we can tackle:

Full-text search: did we already accomplish this in the prototype? I can't tell from the readme or demo. Does Google Dataset Search do this? Also can't tell.
Broad access to EISs across several databases – Google Dataset Search theoretically solves this to a large degree. Unfortunately, it’s completely unclear which databases. It's possible that at the moment, our demo accesses more of them (see example search below)– but possibly this is something we can work with them to include.
Subject-based (rather than title-based) search: “The easiest way to find the documents is to use an external search engine such as Google, but this only works if the person conducting the search is looking for a particular document and knows the name of the document or the project. It would not work for a person who wants to find multiple documents that cover a particular subject or issue (e.g., recent EISs that involve coal mining).” – I think Google Dataset search resolves this to a large extent
Direct link to downloadable PDF (again, can we do this? Or are we limited here?)

Issues that we are not in a position to approach:

Non-inclusion of hard copies: tens of thousands of EIS & related documents exist only in hard-copy form.
Non-standard formatting/content of documents: Wentz notes that a lack of formal standards makes it challenging to compare across documents & check accountability

Issues we might be able to work on (is this feasible/interesting for us?)

Record removal: “Many agencies also remove these documents from their servers after a short period of time, sometimes as little as a year after posting.”
Records limited to EIS: “Only a small subset of the total universe of EIA documents can be obtained through [existing] databases – they do not provide access to federal EAs, RODs, FONSIs or supplemental studies, nor do they provide access to most state and local EIA documents. Some older EISs are also excluded from these databases. … Given that the vast majority of environmental reviews result in an EA rather than an EIS, the lack of access to these documents is highly problematic.”

The elephant-in-room question is: are we in a position to provide something Google Dataset Search is not?

Having played around (slightly) with Google Dataset Search, I’m not very impressed. It seems they are running into several of the issues Wentz notes that I don’t think we’re in a position to tackle:

"A search tool like this one is only as good as the metadata that data publishers are willing to provide. We hope to see many of you use the open standards to describe your data, enabling our users to find the data that they are looking for." –their intro blog post

Edit: I had a quick user study here, but on reflection have removed it as not valid & distracting from the central conversation. It now lives here just for documentation purposes

OK, so what are the key questions?

We can solve some but not all of the problems outlined in the paper. Given that, are the problems that we can solve useful to solve in isolation from the problems we can't solve? Perhaps this is a discussion that has been had already with the paper's author?
Of the problems we can solve, are we in a position to solve them better than Google Dataset Search, Google [web] Search or other similar services?

I think this is a really cool project. But I also think that we need wholehearted "yes" answers to both of the above questions if we are to pursue it further.

Curious to hear thoughts on this.

Mr0grog commented 5 years ago

I tried looking for EIS's of a controversial sewage treatment plant near my home on these various tools… [when searching our prototype,] I got results, but none of them relevant to what I actually wanted… I tried searching the relevant terms in a Google Search, and immediately found what I was looking for:

The first thing that jumped out at me here that the EIS in question is from 2003, and the EPA database, which is the only data we have in the prototype, only goes back to 2012. So it’s a given that this was never going to be findable in our search.

I think we’d need to pick a newer topic if we wanted to test the potential utility of this kind of solution. More importantly, though, I feel like searching by name of facility (which often corresponds directly to the name of the document) isn’t a great use here — any of these would be easy to find through other means if you knew the name, and I think what the authors of that paper were getting at (both from the paper and from our meeting with them) is something more like: “can I search for salmon to see projects affected salmon habitats and how that was documented?” (See also your note about subject-based rather than title-based search; that was the main reason to start with full-text indexing, IIRC. It gets you partway to this without any need to figure out what tags or keywords to apply to every document.)

Full-text search: did we already accomplish this in the prototype?

This is really the only major thing we accomplished in this prototype :)

Try searching for “herbicide,” for example. That’s not in any titles, but you’ll get plenty of results.

Does Google Dataset Search do this? Also can't tell.

That one is also unclear to me, but it certainly seems like it’s more oriented towards things like tabular data, which might be components of EISes, but EISes themselves.

It's possible that at the moment, our demo accesses more [databases than Google Dataset Search]

Well, our demo only searches one (the EPA database), so this is either a totally divergent set from Google’s Dataset Search or a subset.

Direct link to downloadable PDF (again, can we do this? Or are we limited here?)

@b5 and @rgardaphe can answer this better, but IIRC there were two issues:

The epa.gov database doesn’t always have a document, and sometimes it has many (and not just PDFs) for a given assessment listing. We punted a little on knowing what to do in that case.
There was some concern about where to best store a big pile of these docs, so we punted on storing and providing access to them at all.

So, if I recalled that right, there’s no serious reason we couldn’t, except in cases where there was no actual document for us to download in the first place.

Non-standard formatting/content of documents: Wentz notes that a lack of formal standards makes it challenging to compare across documents & check accountability

I don’t think we were thinking about this point in particular, BUT if we could reasonably parse text out of it, we could always provide reformatted copies of every document in one standard HTML template. That’s a little iffy though — if you can only read 90% of the content, it’s still useful in a search index in a way that it might not be if you took that extracted text and tried to make a document from it again. You might be missing critical parts for a human who is reading it narratively.

Issues we might be able to work on (is this feasible/interesting for us?)… Record removal

I think the really important point here is that there has to both be very strong external interest and serious funding for these. If we are stepping up to be a reliable provider of long term storage here, that’s a big commitment, and we have to be careful that we and the data don’t suddenly disappear. I think we’d love to do that, but that means it really needs some serious, stable funding. And the argument we’ll face then is: why do we deserve the funding to do that more than, say, Hathi trust or Northwestern University Libraries, who are already sort of doing this?

are we in a position to provide something Google Dataset Search is not?

I think, as I said earlier, this actually isn’t a problem Google Dataset Search is really even attempting (I could definitely be wrong). And you certainly aren’t wrong about the other issues. As we’ve all said in meetings, to go a whole lot farther requires some serious commitment of time and energy from many people, finding both clever and just dumb brute-force solutions to things like: tagging docs, acquiring & scanning hard copies, building relationships (and ETL pipelines) to get digital copies from other libraries and agencies, etc. etc.

titaniumbones commented 5 years ago

just noting what @Mr0grog says:

I think the really important point here is that there has to both be very strong external interest and serious funding for these. If we are stepping up to be a reliable provider of long term storage here, that’s a big commitment, and we have to be careful that we and the data don’t suddenly disappear. I think we’d love to do that, but that means it really needs some serious, stable funding. And the argument we’ll face then is: why do we deserve the funding to do that more than, say, Hathi trust or Northwestern University Libraries, who are already sort of doing this?

Probably we need to think of these as 2 quasi-separate issues:

to develop and deploy a useful version of this thing, we would need money to pay developers
to serve these documents in a reliable, long-term way, we need institutional stability

The first of these is already difficult for us. The latter is very far away. In the medium term, we are not going to be come the metastable entity that provides long-term reliable access. An institutional partner would be way better suited to that task, whether it's IA or a University Library or potentially someone else. So that means there are at least 2 blocking issues:

determine if there's support from users and funders for this project. If so, we can at least build it
determine whether we have enthusiastic institutional partners willing to host this tool in the long run and maintain it when it starts to suffer from bitrot.

Probably we need to have a pretty high chance of success on both of those before we invest any more time in the project :-/

Mr0grog commented 5 years ago

👍 on that.

Probably we need to have a pretty high chance of success on both of those

Cannot agree enough here. The civic tech landscape is littered with projects that only achieve useful code without any institutionalization or long-term effort, which often means they don’t have any real utility at all.

we would need money to pay developers

+ people building relationships with other orgs, people sitting and analyzing/tagging/geotagging docs, people acquiring & scanning docs, etc. I want to re-emphasize that a huge amount of the effort required to solve the problems in the paper is not software programming (it could be developers doing all these roles, but also maybe not).

enthusiastic institutional partners willing to host this tool

Personally, I don’t feel that’s as big a deal. If we are sustainable enough to support ongoing activities around the above stuff, we ought to be sustainable for hosting (it’s almost certainly the far lesser cost).

titaniumbones commented 5 years ago

On Oct 7, 2018 1:09 PM, Rob Brackett notifications@github.com wrote:enthusiastic institutional partners willing to host this tool

Personally, I don’t feel that’s as big a deal. If we are sustainable enough to support ongoing activities around the above stuff, we ought to be sustainable for hosting (it’s almost certainly the far lesser cost).

I'm not thinking of the money, but ifnthebtimenscale. We've been around for 2 years. We might or might not be around for 10 more,maybe. A library user will want a tool with a longer lifespan.

—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or mute the thread.

Mr0grog commented 5 years ago

I didn’t only mean $$$ when I said “cost” ;)

Anyway, I think I hear you on that but also partly disagree a bit, @titaniumbones. But I also don’t think that discussion’s going to fit well in GitHub comment boxes here. Better on a call sometime.

dcwalk commented 5 years ago

^^ I agree about a call to facilitate that conversation and also appreciate the thoughtful questions so far!

(also: just wanted to note that I added a space on @titaniumbones' comment to separate out quote from reply (It took me a couple readthrus)

shapironick commented 5 years ago

We could try to marshal a massive amount of volunteer labor for the paper issue a la data rescue, as we have paid coordination now... its nice that this would have a more limited scope (although even the thought of this likely triggered many people who have just recovered from Data Rescue). It would depend on institutional support/partnerships w Northwestern though probably. In terms of tech development I wonder how much of this could overlap w/ QRI

Frijol commented 5 years ago

Updating with feedback from the paper's author:

Top features:

[ ] ability to see subsequent pages of results (https://github.com/edgi-govdata-archiving/eis-search/issues/6)
[ ] ability to filter by different metadata (e.g., agency, publication year, etc.)

In the long run, having a broader universe of EIA documents would be great. But that might be too ambitious for this first update to the initial prototype.

She also had an interest in finding out whether Columbia U could provide resources (labor &/or hosting)– not a commitment, but promising

Frijol commented 5 years ago

Closing as this conversation has resolved as follows:

We're surfacing more results than 10, obviating need for pagination (https://github.com/edgi-govdata-archiving/eis-search/pull/5)
Once this is merged, we will pass the demo to the paper's author, who thinks it will be immediately useful
After waiting a month or so to see how useful it really is (direct feedback & google analytics) we should re-open the discussion of (a) is this useful and (b) what would it take to make it really great

edgi-govdata-archiving / eis-search