NRGI / resourcecontracts.org

Resource Contracts
http://resourcecontracts.org
GNU General Public License v2.0
16 stars 9 forks source link

[OLC only]: Do not include option to download Word contract from advanced search page #381

Closed KaitlinCCSI closed 8 years ago

KaitlinCCSI commented 8 years ago

At least for now, there should not be the option to download the Word version of a contract from the advanced search page. There are too many errors in these documents, and there is no disclaimer or explanation regarding what these are, so it raises too many credibility and risk issues. Only option on this page should be to download PDF version.

I also strongly suggest that Word versions of contracts not be downloadable from the contract view page, at least until OCR text is fixed. If a user really wants this, they can copy/paste text themselves. (Happy to hear arguments for making this type of document downloadable from contract view page, but removing from the advanced search page is absolutely necessary for sprint 9.) screen shot 2015-09-24 at 9 45 25 pm

cc @anderspeders @jedm @samccsi

jedm commented 8 years ago

Why would we offer editable versions of these contracts ever?

On Sep 24, 2015, at 7:51 PM, KaitlinCCSI notifications@github.com wrote:

At least for now, there should not be the option to download the Word version of a contract from the advanced search page. There are too many errors in these documents, and there is no disclaimer or explanation regarding what these are, so it raises too many credibility and risk issues. Only option on this page should be to download PDF version.

I also strongly suggest that Word versions of contracts not be downloadable from the contract view page, at least until OCR text is fixed. If a user really wants this, they can copy/paste text themselves. (Happy to hear arguments for making this type of document downloadable from contract view page, but removing from the advanced search page is absolutely necessary for sprint 9.)

cc @anderspeders @jedm @samccsi

— Reply to this email directly or view it on GitHub.

KaitlinCCSI commented 8 years ago

I have no idea. This was not my idea and doesn't make sense to me. I guess it then allows people to more easily cut/paste specific clauses from contracts if they're doing research, etc, but I don't personally think it's necessary, particularly since OCR text will already be available.

On Thu, Sep 24, 2015 at 9:56 PM, jedm notifications@github.com wrote:

Why would we offer editable versions of these contracts ever?

On Sep 24, 2015, at 7:51 PM, KaitlinCCSI notifications@github.com wrote:

At least for now, there should not be the option to download the Word version of a contract from the advanced search page. There are too many errors in these documents, and there is no disclaimer or explanation regarding what these are, so it raises too many credibility and risk issues. Only option on this page should be to download PDF version.

I also strongly suggest that Word versions of contracts not be downloadable from the contract view page, at least until OCR text is fixed. If a user really wants this, they can copy/paste text themselves. (Happy to hear arguments for making this type of document downloadable from contract view page, but removing from the advanced search page is absolutely necessary for sprint 9.)

cc @anderspeders @jedm @samccsi

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/NRGI/resourcecontracts.org/issues/381#issuecomment-143109096

.

Spam https://antispam.law.columbia.edu/canit/b.php?i=01PleUNVH&m=0b60b8e648da&t=20150924&c=s Not spam https://antispam.law.columbia.edu/canit/b.php?i=01PleUNVH&m=0b60b8e648da&t=20150924&c=n Forget previous vote https://antispam.law.columbia.edu/canit/b.php?i=01PleUNVH&m=0b60b8e648da&t=20150924&c=f

KAITLIN Y. CORDES

jedm commented 8 years ago

Yes. Suggest that everything be cut-and-pastable but only PDFs be downloadable in bulk. @byndcivilization @jcust what do you think?

On Sep 24, 2015, at 7:58 PM, KaitlinCCSI notifications@github.com wrote:

I have no idea. This was not my idea and doesn't make sense to me. I guess it then allows people to more easily cut/paste specific clauses from contracts if they're doing research, etc, but I don't personally think it's necessary, particularly since OCR text will already be available.

On Thu, Sep 24, 2015 at 9:56 PM, jedm notifications@github.com wrote:

Why would we offer editable versions of these contracts ever?

On Sep 24, 2015, at 7:51 PM, KaitlinCCSI notifications@github.com wrote:

At least for now, there should not be the option to download the Word version of a contract from the advanced search page. There are too many errors in these documents, and there is no disclaimer or explanation regarding what these are, so it raises too many credibility and risk issues. Only option on this page should be to download PDF version.

I also strongly suggest that Word versions of contracts not be downloadable from the contract view page, at least until OCR text is fixed. If a user really wants this, they can copy/paste text themselves. (Happy to hear arguments for making this type of document downloadable from contract view page, but removing from the advanced search page is absolutely necessary for sprint 9.)

cc @anderspeders @jedm @samccsi

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/NRGI/resourcecontracts.org/issues/381#issuecomment-143109096

.

Spam https://antispam.law.columbia.edu/canit/b.php?i=01PleUNVH&m=0b60b8e648da&t=20150924&c=s Not spam https://antispam.law.columbia.edu/canit/b.php?i=01PleUNVH&m=0b60b8e648da&t=20150924&c=n Forget previous vote https://antispam.law.columbia.edu/canit/b.php?i=01PleUNVH&m=0b60b8e648da&t=20150924&c=f

KAITLIN Y. CORDES

— Reply to this email directly or view it on GitHub.

byndcivilization commented 8 years ago

I think @anderspeders will jump in but I don't see any reason not to have downloadable content. That's been in the scope pretty much from the get go.

KaitlinCCSI commented 8 years ago

Downloadable Word docs were not in any scope that I was aware of. I think there are significant risks to this, since these are not true contracts, and there are so many errors in the text right now. Can we make only the PDFs downloadable in bulk until after the RC re-launch, and then discuss in our meeting in November?

jedm commented 8 years ago

@KaitlinCCSI is correct. Downloadable word docs were a high-level no-no since inception of the project in 2011. High political risk.

byndcivilization commented 8 years ago

@anderspeders ? I specifically remember a whole discussion about why this was needed at Columbia workshop last Nov. @jimc33 remembers this also. OLC i guess is free to do what they want but I think on RC we are pretty firm.

jedm commented 8 years ago

If Legal at NRGI likes it then I like it. But worth double-checking with them and with Lindsey M. at the Bank.

KaitlinCCSI commented 8 years ago

For OLC - don't make Word docs available. For RC -- agree about checking with Legal at NRGI. Lindsey, Michael, Perrine -- or at least one of them -- should also weigh in. So for now, I think the team can remove Word doc download option during this sprint, and then can confirm for next sprint whether this should be added ONLY for RC.

LindseyAM commented 8 years ago

There should definitely not be downloadable WORD docs in my opinion. PDFs fine.

jimc33 commented 8 years ago

What about txt files? The main idea behind the OCR process is to make the data available for analysis which pdf prevents

On Friday, 25 September 2015, LindseyAM notifications@github.com wrote:

There should definitely not be downloadable WORD docs in my opinion. PDFs fine.

— Reply to this email directly or view it on GitHub https://github.com/NRGI/resourcecontracts.org/issues/381#issuecomment-143307881 .

LindseyAM commented 8 years ago

I dont remember having any conversations about making the contracts editable. My understanding is OCR was to enable comparability, but not to enable users to edit the contracts.

KaitlinCCSI commented 8 years ago

Agree with Lindsey, and that was my understanding about OCR, too.

Also, hard to analyze wingdings. See screenshot of what this looks like for an older contract that hasn't gone through MT. In addition, if anyone really wants to take all of this text and put it in a Word Doc, it's easy to copy and paste from the text view on the Contract View Page. Hard to think of how this fits into our user stories, though. screen shot 2015-09-25 at 1 16 54 pm

jimc33 commented 8 years ago

I recall discussing this at some length in Columbia at the kick off meeting last year. It's a big priority from my point of view. Not about editing contracts; it's for the open data piece so contracts can be analysed

On Friday, 25 September 2015, LindseyAM notifications@github.com wrote:

I dont remember having any conversations about making the contracts editable. My understanding is OCR was to enable comparability, but not to enable users to edit the contracts.

— Reply to this email directly or view it on GitHub https://github.com/NRGI/resourcecontracts.org/issues/381#issuecomment-143311742 .

jedm commented 8 years ago

@jimc33 I can see why the OCR text should be machine-readable for those institutions or projects with the resources to do Artificial-Intelligence-style text mining or analysis, but does that mean any site user needs to be able to grab the raw text on a "retail" basis, contract-by-contract?

I would assume there are ways to generate a purpose-specific, short-term key for bulk download, or something like that. The popularity of the use case for bulk analysis will remain so much lower than the interest in PDF downloads, that it seems like the risks of making the former widely available far outweigh the challenges of making the latter doable but not one-click as the PDF downloads should be.

(All of this is of course assuming fully Turked and re-vetted text, I assume. Because otherwise the OCR wouldn't be analyzable anyway.)

byndcivilization commented 8 years ago

Just out of curiosity @KaitlinCCSI which contract is that from. Wingdings should be far from the norm and should be caught in the uploading process and sent to Mech Turk.

byndcivilization commented 8 years ago

also @jedm @KaitlinCCSI and @LindseyAM what exactly is the risk here? That someone will edit a contract (which remember we have the original pdf on teh site) and take it somewhere and present it as fact? Because I can basically just create a dummy document and do that now.

jedm commented 8 years ago

@byndcivilization per above, advise consulting with NRGI Legal.

LindseyAM commented 8 years ago

From what I understand @byndcivilization there is 2 things. 1 - the quality of those editable files is very bad (per the screenshot from K), 2- the word file doesnt respond to a strong use case (as far as I am aware, no one has asked for editable versions of the contracts). @jimc33 can documents be machine-readable and 'analyz-eable' without being editable and still meet your priority? The bad quality of the downloadable word files seems to be a pretty big problem to solve if the use case doesnt make it worth it. No? Keen to make our lives easier.

anderspeders commented 8 years ago

For OLC it is obviously a choice of WB and CCSI,

For RC I would argue that it is simply good open data practise to make the contracts available from the OCR / mechnical turk process in an open document format by default.

The open document format is for example the current policy of the UK government for documents intended for sharing and collaboration: https://www.gov.uk/government/publications/open-standards-for-government/sharing-or-collaborating-with-government-documents

So to summarize suggestion for RC: 1) Open document format download option to supplement the PDF plus 2) A proper disclaimer that explains the method by which this was extracted through OCR or Mechanical Turk and that the text might be incomplete.

NB: Finally, an open document format will be smaller in size and thus be more inclusive to communities with limited internet access.

byndcivilization commented 8 years ago

We are throwing a lot of resources at this digitisation. It seems a shame to squirrel it away because of fear. I think the poor quality of some of the docs will actually be remedied by this as it will help is to identify contracts that slipped though the cracks.

I would also acutally argue that while yes some of the contract .txt quality is not perfect, there are many cases where the .txt is better (though sometimes marginally) in terms of readability than the original pdf. See:

http://alpha.openlandcontracts.org/contract/1028/view#/text vs http://alpha.openlandcontracts.org/contract/1028/view#/pdf

http://alpha.openlandcontracts.org/contract/682/view#/text vs http://alpha.openlandcontracts.org/contract/682/view#/pdf

http://alpha.openlandcontracts.org/contract/1027/view#/text vs http://alpha.openlandcontracts.org/contract/1027/view#/pdf

jimc33 commented 8 years ago

It's not so much about editing as being able to data mine the contracts - so it's the raw text we need. Putting in a PDF is essentially reversing the OCR process and makes it non machine readable. A txt file would be sufficient if Word doc is problematic. We plan to be running a hack event using the raw text and contract analysis in Mexico so that's one use case. And this is something we plan to be doing quite a bit more on in 2016 too: pulling contract data into ResourceProjects.org, linking to AMLA etc.

On Friday, 25 September 2015, LindseyAM notifications@github.com wrote:

From what I understand @byndcivilization https://github.com/byndcivilization there is 2 things. 1 - the quality of those editable files is very bad (per the screenshot from K), 2- the word file doesnt respond to a strong use case (as far as I am aware, no one has asked for editable versions of the contracts). @jimc33 https://github.com/jimc33 can documents be machine-readable and 'analyz-eable' without being editable and still meet your priority? The bad quality of the downloadable word files seems to be a pretty big problem to solve if the use case doesnt make it worth it. No? Keen to make our lives easier.

— Reply to this email directly or view it on GitHub https://github.com/NRGI/resourcecontracts.org/issues/381#issuecomment-143321450 .

LindseyAM commented 8 years ago

Hi All. Thanks. I think Anders proposes a fair solution for RC.org that manages the quality concerns while preserving the spirit of 'opening' the documents.

jedm commented 8 years ago

Still eager to defer to Legal, Lindsey and Lisa's team :) on all this, but I just want to reassure my open data colleagues that I'm not suggesting blocking use cases like bulk downloads or hack events, simply that the most common use cases will be in contexts where the accurate contracts (and their meaning and relationship to other contracts and good practice) matter and their attendant politics are real.

What I do feel very strongly is that repositories like this must prioritize certain use cases (that is, the design must suggest and encourage the most desired, most common user). The biggest mistake we make on the tech side is to treat all potential uses equally and create designs that are accordingly "flat." Such "design by no-design" favors the most savvy and impedes scalable use by a wider community.

So, permit APIs, hacks and open formats, but emphasize PDFs and CSO users and low-savvy journalists, etc. As mentioned in another thread, the savvier a user is, the more clicks she or he will be willing to make to reap the rewards of all the tech side's hard work digitizing.

KaitlinCCSI commented 8 years ago

Re examples from @byndcivilization -- those have ALL been through the MT process. My understanding is that none or almost none of the RC contracts have gone through the MT process yet. Until then, I think it's fair to say that most of the OCR texts on RC would be significantly harder to read than the PDF version. (And to answer your question, the screenshot I included was from Salala, which was perhaps not a good example b/c it's such an old contract - but it's one of the only ones published on OLC that hasn't yet been through MT. I just opened up RC and pulled the first contract there - Mittal - screenshot attached of what part of it looks like. Still problematic.) So I think one question here for RC is whether it makes sense to simply wait until after all RC contracts have gone through MT to make this option available. Regardless, I very much agree with @jedm comments about prioritizing certain use cases, and don't think anyone at CCSI wants to block hacks/open formats; just want to make sure that the site comes across as credible and reliable for all of the prioritized users in the user stories that we spent a lot of time developing. One possible fix might be, at least for now, providing this in .txt format instead of Word documents, which might help distinguish a bit what these are and wouldn't automatically look like a contract. It doesn't fully address potential issues, but might help a bit. I must admit that I don't quite understand, though, why these text documents have to be downloadable directly from the advanced search page - seems like anyone trying to do the type of opendata work you're describing would also know how to pull .txt/Word versions efficiently even if there isn't a download button on the advanced search page? In fact, wouldn't most people/machines be scraping this somehow rather than having someone click and download each Word doc separately? screen shot 2015-09-25 at 2 34 28 pm

anderspeders commented 8 years ago

Decision recap: For OLC: Word download option will be removed.

@anjesh Please remove word download option.

(For RC: This will be pending decision by group.)

byndcivilization commented 8 years ago

RC interim disclaimer added in pull request

anjesh commented 8 years ago

Word document document option removed from OLC.