benwbrum / fromthepage

FromThePage is a wiki-like application for crowdsourcing transcription of handwritten documents.
http://fromthepage.com
GNU Affero General Public License v3.0
170 stars 51 forks source link

Access Control Lists/Private Transcription Projects #5

Closed srl295 closed 9 years ago

srl295 commented 12 years ago

It's great that FTP is being used for transcribing important documents, however, it could be useful in situations where the content is not 'world readable'. Consider access control lists to restrict access to projects.

This could be a plugin model, where the access control is provided by other code. (In fact, that's where integrating fromthepage with other code might come in)

benwbrum commented 12 years ago

I've already implemented transcription tool access controls on content that is not world-writable, but yours is about the third request I've had so far for purely private projects. I wonder if you could give me a particular use example to help define such a feature?

Questions that have occurred to me are:

srl295 commented 12 years ago

Use case: recent family history. Older family history might be OK for public record (as in the Diaries on your fromthepage site), but again it might not be. I'm actually considering using fromthepage for my own handwritten notes, with a very small transcription community, that might not be for public consumption for some years yet. Even older family history could potentially be sensitive.

Walling off the entire instance is a potentially workable solution, with the only downside of requiring one instance per user community.

URL hacking would be a concern - urls could be sent through a script to verify log-in status before serving up the bits. Of course, you would not want to do this in the 'public' model.

Thanks, appreciate your work here.

benwbrum commented 12 years ago

While my goal in writing the FromThePage software has been to get privately-held materials out into the public, this seems like it might offer hosts (like me when I wear my fromthepage.com-webmaster hat) a way to recover some of our hosting costs. Do you think people might be willing to pay approximately $5/month for the ability to host private collections?

Regardless, would offering private projects open up the hosting provider--and (eek!) the author of the software--to liability if the private data were exposed, possibly through cracking attempts or guessed passwords? Up until now, I've written the software to a fairly low-security model, since historic documents don't require the same security as credit card numbers or such. Instituting truly high security would require things like HTTPS (to prevent packet sniffing), a better authentication engine, and the URL-guessing lock-downs on image service we discussed earlier.

I'm not sure that all private projects require this sort of thing. One request for private hosting I received about a year ago was for historic material that had been purchased by researchers who wanted to make sure their own edition based on that material was published before the documents became public. This desire not to be "scooped" seems like it could be handled very simply through the basic access controls you originally mentioned, without resorting to the more extensive measures needed for SSN/CC kinds of data.

Do you have any examples of the security policies/measures offered by genealogy sites? I wonder what Geni or Familysearch does about this. I imagine that you'd be pretty mad if your notes were exposed via something simple like a URL hack, but how much would you worry about packet sniffing middle-man attacks?

srl295 commented 12 years ago

Ben, I think people might support it, it might be something to look into.. not really my goal, at the moment. Maybe even charge for the hosting, but give ways to reward those who contribute to public projects that have some sort of grant attached to them (micro tipping or something, with a % cut going to the host).

Liability— I'm not a lawyer, but the standard concerns apply. That's always the case when there's private data. That's why my model is, I physically own/possess the hardware and manage the security thereof. The point of the security is sort of just to preserve the document intact, so that it could be made public or widely available in the future — after contents fail to be a security threat to those living.

As far as man in the middle, packet sniffing, URL hacks…  I think that everything could be handled by having HTTP Basic (web server level) access control on the entire /fromthepage/ project (excellent idea).. the authentication could be using mod_auth_mysql or similar so that the access DB could be shared with other content... and finally, minor changes to fromthepage to have it accept the HTTP-basic authentication in place of its own. A pluggable authentication callback in F.T.P. would probably suffice there. Authentication callback actually seems like a bigger and bigger deal, because I have all of these related apps implemented in different languages, and I don't want to have my increasing set of users have a different account for every service. OpenID is one way to handle this, as is a plugin that can take the authentication data from somewhere else.

Genealogy privacy - I would refer to for example http://www.phpgedview.net/privacy.php

Thanks again for discussing this and for your work on fromthepage…

dlev commented 12 years ago

Ben, thank you for the invitation to comment on this discussion. I am intrigued by the potential for crowd sourcing with large genealogy and family history projects, but share the concern for privacy. Yes, I would be very willing to support hosting fees if they allowed privacy settings.

My own experience with crowd sourcing involved introducing my project to my high school English students and teaching them to do basic transcribing. Students were able to transcribe about 100 personal letters (many multi page) in two or three class periods. It was cumbersome using the school server to host images and MS word for transcribing. We spent a lot of time explaining file management. I think a similar arrangement with public help would be equally awkward.

My wish list would be for an all in one transcription tool and document hosting solution where I could build and manage my project, invite participants, and access transcriptions. I'm not so worried about the URL being hacked as I am for family members to find letters and text before the material has been screened. Some sensitive information will need to be introduced with care.

The other issue, about being "scooped," is valid as well. Family history research is filled with assumption and theory passed off as fact by well meaning relatives. I think some kind of privacy passwords or such might help keep research under wraps until it was ready to go public.

For now, I am looking at a project of a few hundred personal letters to be transcribed and indexed. I have the completed student transcriptions as well, but they have not been indexed or proofed. I look forward to finding a solution that will move my project forward.

benwbrum commented 12 years ago

I've added a baseline feature to support private collections. This will remove them from view in the application, but not prevent URL-guessing based downloads of page scans from the webserver, which is a task that requires an upgrade to Rails 3.

dlev commented 12 years ago

Hooray! Thank you. I can't wait to check it out.

Denise

On May 30, 2012, at 11:02 AM, "Ben W. Brumfield"reply@reply.github.com wrote:

I've added a baseline feature to support private collections. This will remove them from view in the application, but not prevent URL-guessing based downloads of page scans from the webserver, which is a task that requires an upgrade to Rails 3.


Reply to this email directly or view it on GitHub: https://github.com/benwbrum/fromthepage/issues/5#issuecomment-6016770

srl295 commented 12 years ago

On 05/30/2012 11:02 AM, Ben W. Brumfield wrote:

I've added a baseline feature to support private collections. This will remove them from view in the application, but not prevent URL-guessing based downloads of page scans from the webserver, which is a task that requires an upgrade to Rails 3.

That's a start. I will check it out- thanks!

dlev commented 9 years ago

Hi Ben,

This just dropped into by mailbox and made me wonder what’s new with From the Page… It looks like I missed the launch of FTP2 and I do hope you are continuing work with the program. I still have a HUGE project working with my grandmother’s letters and papers and would welcome a way to move forward.

I’m checking out your website now. Please keep me on your mailing list.

Thanks, Denise

Denise May Levenick The Family Curator http://thefamilycurator.com dmlevenick@gmail.com

On Aug 18, 2015, at 8:59 AM, Ben W. Brumfield notifications@github.com wrote:

Closed #5 https://github.com/benwbrum/fromthepage/issues/5.

— Reply to this email directly or view it on GitHub https://github.com/benwbrum/fromthepage/issues/5#event-385522937.