Closed titaniumbones closed 1 year ago
@titaniumbones are you talking differences in the actual systems themselves or the data we are storing and making public? If the latter, that is documented here (in the source_metadata
item): https://github.com/edgi-govdata-archiving/web-monitoring#versions
There’s no pagefreezer info there yet because we do not have a consistent format to document for it yet.
There’s no pagefreezer info there yet because we do not have a consistent format to document for it yet.
Same for IA, too.
@Mr0grog I think it was the data in the actual systems themselves. But... good question! Um. This came out of the GSoC meeting, where we were trying to turn the candidates' proposals into issues which could be grouped into milestones. So I think @janakrajchadha @suchthis @mhucka @danielballan will probably refine the issue together!
@janakrajchadha Your proposal included "Understand the differences...." Once the differences are clear in your mind, documentation can be a concrete achievement for this task.
Oops, @Mr0grog's comment and the subsequent ones hadn't loaded for me when I posted the above. Yes, the task here is to both determine and document source_metadata for PF and IA.
Ah, sorry! Didn’t realize this was coming out of another discussion. 👍
@danielballan @titaniumbones Can either one of you redirect me to the place where a similar thing has been done for Versionista (if it exists)?
@janakrajchadha in terms of the raw data we can get out of Versionista, that’s never been documented:
You could check out a recent output file to get a feel, though: https://s3-us-west-2.amazonaws.com/edgi-versionista-archive/versionista1/metadata-2017-06-20T00%3A00Z.json
@titaniumbones are you talking differences in the actual systems themselves or the data we are storing and making public? If the latter, that is documented here (in the source_metadata item): https://github.com/edgi-govdata-archiving/web-monitoring#versions
@Mr0grog I may be wrong here, but what we're storing as versions contains the different fields which we want in our DB and the source_metadata is what we get from the source itself. After taking a look at the recent Versionista output file, I would say that the source_metadata for a general case of Versionista output is in fact documented well. How is the data which we are storing and making public (in the source_metadata field) different from the data in the actual system output? Am I confusing terms here?
source_metadata is what we get from the source itself
It‘s close, but not exactly the same. source_metadata
doesn’t include fields that are already represented in the page
and version
records and also flattens some fields (e.g. diff.hash
→ diff_hash
). See here for the script that converts raw Versionista scraper output to DB input: https://github.com/edgi-govdata-archiving/web-monitoring-versionista-scraper/blob/master/bin/import-to-db#L55-L77
@Mr0grog Oh, that was a little confusing earlier because of the term source_metadata
being used for different things. Thanks for the clarification!
I was just adding the information in the documentation for some of the fields in the PageFreezer output and I was wondering if summarizing the internal info is a better way to go. Since we were talking about the data in the actual systems themselves, there are some fields which do not concern us. I just wanted to get a sense of how detailed we want this documentation to be. @danielballan @titaniumbones @suchthis @mhucka
@titaniumbones @mhucka @suchthis @danielballan @Mr0grog I've documented the data format of the different sources and I've also added a table for differences between them. A few fields don't have a description as I wasn't sure what they meant. Also, there's another IPython notebook which can be used to view an example of output for IA and PF. I've added a link to a Versionista output in the document itself. See https://github.com/edgi-govdata-archiving/web-monitoring-processing/pull/60/commits/d92164f553485b099f9b73d75cf3d0a2f067aced Please review
@danielballan Should this be closed or should we keep this open as I still have to add a little more information to the document?
Let's leave it open to track our progress. Would you enumerate the blank entries here? Then we can ask for external help.
Data
: Depth
:TaskId
:Url0
: Url1
: UrlType
: Writeflag
:diffWithPreviousDate
:diffWithFirstDate
:A few fields don't have a description as I wasn't sure what they meant…
Versionista
diffWithPreviousDate
diffWithFirstDate
These two fields are kind of weird and are sort of a result of the CSVs that analysts are currently using.
diffWithFirstDate
is listed in the CSVs as the “date” of the diff between the current version and the first-ever-captured version. Diffs don’t really have a date, though, so this is actually just the capture date of the first-ever captured version of this page.
diffWithPreviousDate
is the date of the current version (not the date of the previous version being diffed with, as you might expect from the name).
Three other minor notes:
hasContent
indicates whether Versionista stored any content, not whether the version actually had any. One of the drawbacks of Versionista is that it won’t store content for files of a certain type or files that are too large (I think the threshold is probably somewhere around 1 or 2 MB, but Versionista has no docs on the actual number).
filePath
should be where it is stored in our public archive, not on Versionista. e.g. if you have:
"filePath": "versionista1/72879-6127248/version-11822980.html"
Then you should be able to retrieve content from:
http://edgi-versionista-archive.s3.amazonaws.com/versionista1/72879-6127248/version-11822980.html
(That said, filePath
is incorrect in some older metadata files, where it is the actual path on disk where we temporarily downloaded the version content before uploading to S3.)
hash
is missing (I think it got accidentally combined with filePath
above)
Thanks a lot @Mr0grog! The date fields are ambiguous.
hash is missing (I think it got accidentally combined with filePath above)
Yeah, I probably mixed this up as the hash and filePath are kept as a single object in the output.
Hmmm, hash
and filePath
should not be a single object. Are there metadata files where they are? If so, we should correct those.
The hash
and path
of the diff are in a single object. The version hash
and filePath
aren't. I think I confused those two. Apologies.
@Mr0grog I think Internet Archive and Versionista have been well documented here. There are a few fields missing in the PageFreezer part and I was hoping that we could bring up the topic of them providing us better API documentation in the coming discussions with them. cc: @ambergman
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.
We have source_metadata_versionista
documented in the DB docs; we should probably do the same for source_metadata_web_monitoring
. Or we should document that info somewhere else. In any case, this still seems relevant.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.
At this point, I’m just going to close this. The project is shutting down.
We are building a flexible framework designed to accommodate a variety of crawled page snapshots. Different services produce different data formats. By documenting them carefully, we set ourselves up for success.