Access to original datasets?

lchski commented 6 years ago

While I know the acceptance criteria are clear about working with the existing .doc/.rtf/.xlsx files, the secondary criterion of “an easy to use method that the Public Opinion Research team at PCO will use in-house for the conversion of future projects” would be much easier if access to the raw datasets were possible.

If you’re always going to be converting .docs in the future, it’s going to be a lot of pain. It’d be much, much easier if the workflow changed so that you got properly structured documents in an open format like .csv from the survey company. If the survey company always sent properly structured data in an open format, you wouldn’t need to convert from .docs. They surely have the ability to export this data in an open format—I doubt they generated those .docs by hand.

Innovative procurement must be about more than just hiring outside help to fix some small issue; it should include recognizing challenges in the internal workflow and being ready to demand changes to address them. Even if the way you hire that help is innovative (and believe me, I think this is great!), scoping projects like this runs contrary to that principle.

So, to put this as a question: is it possible to change the workflow that leads to the generation of .docs, receiving instead the original datasets in an open format, so that conversion work like this wouldn’t be necessary in the future?

lchski commented 6 years ago

I just want to clarify: I’m not trying to tear down the project. I think it’s great, and I love the approach you’re taking with these microprocurements—I want you to succeed, because if you succeed we’ll all benefit. I think success in this instance could come from rethinking how this data is received and published; to do that would require the datasets in a proper structured format.

matthewdarwin commented 6 years ago

Or perhaps this is an interim step to make the documents somewhat more accessible. ie fix what is in your control first, then work backwards and ask others to do better.... which is a great way to approach the problem.

lchski commented 6 years ago

Yeah, absolutely! I think it’s totally valid to try to fix the existing documents.

It’s that second criterion that prompted me to write this; I don’t think there’s an “easy to use method” for future projects that’s feasible for less than $10k (when you factor in the conversion costs for the existing docs) without access to properly structured data in an open format.

matthewdarwin commented 6 years ago

I agree there is not an "easy to use method for future projects" that gets you WCAG AA compliance. And I would go further to say no matter the cost. Give me $10 million, and it still cannot be done. It is simply not possible to create an easy to use method to automatically infer information in unstructured documents.... the information has be be added by someone who is knowledgeable about the source material.

If we can remove the requirement that it has to be WCAG AA compliant, I think this project still has value: that is we can make things better than what is there today.

WCAG compliance is a journey.... make it better today than what you had yesterday. Repeat.

mgifford commented 6 years ago

There are lots of ways to apply automated testing to flag for common machine determinable problems. That might at best catch 25% of the problems. Still, those might be the most common accessibility problems produced in government documents.

Best to try to fix the problem at the source though which often comes down to what is the workflow used in managing and maintaining the information.

RachelMuston commented 6 years ago

Hi @lchski, @matthewdarwin and @mgifford ,

Great to get your comments!

Good news – we do have .csv files for the two data sets in the opportunity. We didn’t originally include these links in the opportunity as we don’t need the .csv files converted to HTML5 but we should have included them. Thanks for the prompt!

Here are the links: http://epe.lac-bac.gc.ca/100/200/301/pwgsc-tpsgc/por-ef/privy_council/2017/081-16-e/datafile-winter_2017.csv

http://epe.lac-bac.gc.ca/100/200/301/pwgsc-tpsgc/por-ef/privy_council/2017/030-16-e/data.csv

We will also add these links to the opportunity page.

Re: the comments on the “easy to use method for future projects” deliverable. I'm going to check with Karen about this one and will get back to you. :-)

mgifford commented 6 years ago

Can you be more clear about what you want this data converted to? Having metadata about the data always helps, but HTML5 is a pretty big collection of markup. What do you want from this? Is it a table? Is it a styled table? Is it a responsive table? Is it an accessible table? The conversion to HTML5 could be any of these (and a few others).

lchski commented 6 years ago

Interesting! So reading the feedback in the image in this comment, it looks like it’s not actually the data files that LAC needs converted, but the executive summaries and final reports.

This suggests that some of the files listed for conversion don’t need to be. I’m thinking of these specifically:

Am I correct in my understanding, that this is just about the summaries and final reports? Looking at the files listed for conversion in French and comparing with those instructions from LAC, that seems to be the case. Just want to clarify.

Thanks for walking through this with us!

KarenMoorcroft commented 6 years ago

Hi all we are working this through on our end. Sorry for the delayed response.

@mgifford this one is a bit hard for us to answer as we don't have any HTML5 expertise in house so don't know what is possible. Our main requirement is that the tables meet accessibility requirements and be clear. The tables only need to display the data. Our feeling is if the users want to slice and dice the information further, they can do so via the .csv file.

@lchski re: your question about what needs converting... while LAC doesn’t require us to submit the 5 files you mention, we provide them in response to requests from those who are not equipped to use the raw data (.csv file). Since they are made available on LACs website, we want to ensure that these files also meet accessibility requirements.

mgifford commented 6 years ago

There are lots of good resources on accessibility with HTML tables https://www.w3.org/WAI/tutorials/tables/

Trouble is that some tables can be very complex. Mind you if you're looking at something no more complicated than a CSV file than that is pretty easy.

From the outside though it makes it very difficult to judge given we don't know your data.

steve-h commented 6 years ago

As the initial awardee to sort this out, I agree all the item mentioned in this thread are valid. My solution was a balance. Best effort on the main reports. Where it was obvious that original datasets were the best answer for an archive it was not worth being heroic in making sensible html that was intrinsically dense data.

Full WCAG AA compliance is also an aspiration as WCAG itself moves to document things that become possible to help accessibility. I chose to favour the items that could be automated, and keep the manual repairs to a minimum as the spirit was to build a magic tool. I built enough to escape from the Word format and try to produce a simple output of the information content of the reports at least.

The tables in these POR studies are beyond the reach of the techniques pointed out by @mgifford , and some of the tables are a huge mess of tables used as a formatting tool and to show multiple information dimensions in a 2-D format that make accessibility "impossible".

My suggestion is to push the accessibility for all people back to the companies that produce these reports and perhaps they won't look like Babylonian balance sheets. Even generating a more comprehensible version from their raw .csv files would be hard for future archive users. Again, focus on the reports is the key to distilling the complexity.

Thus access to the original datasets is a good but should not be a part of fixing a report.

canada-ca / PCO-Public-Opinion-Research__Recherche-en-opinion-publique--BCP

Access to original datasets? #6