Living-with-machines / lwmdb

A django-based library for managing the Living with Machines newspapers metadata database schema
https://living-with-machines.github.io/lwmdb/
MIT License
2 stars 0 forks source link

Data to purge from the repository for rights #100

Closed griff-rees closed 1 year ago

griff-rees commented 1 year ago

This may require purging the git history and worth checking with @claireaustin01

griff-rees commented 1 year ago

Removed (but not purged) fixture-files/mitchells_db [v1].csv

kallewesterling commented 1 year ago

Upon review, I don't think we need to remove the census data. It is available open-access through the UK Data Service… I believe we are able to re-share it (I wouldn’t have added it to the repo otherwise), and upon revisiting CC BY 4.0, it states that we “are free to . . . copy and redistribute the material in any medium or format” (see here).

Looping in @claireaustin01 might be good regarding this bit, however.

griff-rees commented 1 year ago

Great thanks @kallewesterling. Perhaps the safest option would be to automatically download that link in a local deploy? Arguably that's applicable to many of these.

griff-rees commented 1 year ago

A potential structure for managing the workflow, where data folders include csv etc. files and fixtures the generated json for the respective models:

newspapers
├── data
├── fixtures
mitchels
├── data
├── fixtures
gazetteer
├── data
├── fixtures
census
├── data
└── fixtures
kallewesterling commented 1 year ago

Great thanks @kallewesterling. Perhaps the safest option would be to automatically download that link in a local deploy? Arguably that's applicable to many of these.

Sounds like a good idea to me. As far as I can see, it would apply to the two publicly available datasets that are used here (if we're sticking with keeping census data in there for now):

The scary thing about download files is obviously that the link are depending on services that provide them, long term etc. etc... You know all this, of course! :)

griff-rees commented 1 year ago

Well done, I was having a quick peak at those links and annoyed to figure out the js involved, thanks for sorting that.

The scary thing about download files is obviously that the link are depending on services that provide them, long term etc. etc... You know all this, of course! :)

Yeah it's hard to maintain. I guess I'm thinking: maybe that addresses that concern for now, and we can return to the issue of having a final version of these included in the repository when we've had enough time to decide what's ok.

Any thoughts on this all much appreciated @claireaustin01

kallewesterling commented 1 year ago

I agree with that @griff-rees !

claireaustin01 commented 1 year ago
mcollardanuy commented 1 year ago

Hi @griff-rees, @kallewesterling, @claireaustin01,

The following files in this folder contain data from Wikidata and Geonames:

Wikidata: according to https://dumps.wikimedia.org/legal.html:

Copyrights of structured data in the main, Property, Lexeme, and EntitySchema namespaces are waived using the Creative Commons Zero (CC0) public domain dedication. All unstructured content in other namespaces is licensed under the Creative Commons Attribution-Share-Alike 3.0 License.

Geonames: according to http://download.geonames.org/export/dump/:

This work is licensed under a Creative Commons Attribution 4.0 License, see https://creativecommons.org/licenses/by/4.0/ The Data is provided "as is" without warranty or any representation of accuracy, timeliness or completeness.

So, as far as I can see, it should be fine.

griff-rees commented 1 year ago

Have backed up all the fixture files. First attempt to purge via https://rtyley.github.io/bfg-repo-cleaner/ has raised the following errors:

$ git push
Enumerating objects: 43, done.
Counting objects: 100% (40/40), done.
Delta compression using up to 4 threads
Compressing objects: 100% (15/15), done.
Writing objects: 100% (24/24), 16.97 KiB | 8.48 MiB/s, done.
Total 24 (delta 18), reused 15 (delta 9), pack-reused 0                                      
remote: Resolving deltas: 100% (18/18), completed with 9 local objects.
To github.com:Living-with-machines/lwmdb
 ! [remote rejected] refs/pull/101/head -> refs/pull/101/head (deny updating a hidden ref)    
 ! [remote rejected] refs/pull/102/head -> refs/pull/102/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/107/head -> refs/pull/107/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/107/merge -> refs/pull/107/merge (deny updating a hidden ref) 
 ! [remote rejected] refs/pull/11/head -> refs/pull/11/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/12/head -> refs/pull/12/head (deny updating a hidden ref)     
 ! [remote rejected] refs/pull/13/head -> refs/pull/13/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/15/head -> refs/pull/15/head (deny updating a hidden ref)     
 ! [remote rejected] refs/pull/18/head -> refs/pull/18/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/19/head -> refs/pull/19/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/2/head -> refs/pull/2/head (deny updating a hidden ref)      
 ! [remote rejected] refs/pull/20/head -> refs/pull/20/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/27/head -> refs/pull/27/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/28/head -> refs/pull/28/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/30/head -> refs/pull/30/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/33/head -> refs/pull/33/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/38/head -> refs/pull/38/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/39/head -> refs/pull/39/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/40/head -> refs/pull/40/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/41/head -> refs/pull/41/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/42/head -> refs/pull/42/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/43/head -> refs/pull/43/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/44/head -> refs/pull/44/head (deny updating a hidden ref)      
 ! [remote rejected] refs/pull/46/head -> refs/pull/46/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/5/head -> refs/pull/5/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/57/head -> refs/pull/57/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/58/head -> refs/pull/58/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/59/head -> refs/pull/59/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/62/head -> refs/pull/62/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/63/head -> refs/pull/63/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/67/head -> refs/pull/67/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/68/head -> refs/pull/68/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/69/head -> refs/pull/69/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/7/head -> refs/pull/7/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/72/head -> refs/pull/72/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/73/head -> refs/pull/73/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/74/head -> refs/pull/74/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/77/head -> refs/pull/77/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/78/head -> refs/pull/78/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/8/head -> refs/pull/8/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/85/head -> refs/pull/85/head (deny updating a hidden ref)
error: failed to push some refs to 'github.com:Living-with-machines/lwmdb'
kallewesterling commented 1 year ago

This looks like a good place to start troubleshooting... It looks like it might be an issue with dropping files in a repo with open pull requests :/

AoifeHughes commented 1 year ago

@griff-rees do you have. the commands you tried with bfg just so I don't re do exactly what you tried

griff-rees commented 1 year ago

Thanks @AoifeHughes pretty sure this is what I found best:

$ bfg --delete-files fixture-files lwmdb.git
griff-rees commented 1 year ago

For reference: I installed bfg via:

$ sudo snap install bfg-repo-cleaner --beta

on an azure vm

AoifeHughes commented 1 year ago

Just tried it with slightly different command:

(playground) ➜  erase git clone git@github.com:Living-with-machines/lwmdb.git
Cloning into 'lwmdb'...
remote: Enumerating objects: 2319, done.
remote: Counting objects: 100% (351/351), done.
remote: Compressing objects: 100% (263/263), done.
remote: Total 2319 (delta 135), reused 167 (delta 82), pack-reused 1968
Receiving objects: 100% (2319/2319), 29.95 MiB | 4.80 MiB/s, done.
Resolving deltas: 100% (1358/1358), done.
(playground) ➜  erase cd lwmdb
(playground) ➜  lwmdb git:(main) java -jar ~/Downloads/bfg-1.14.0.jar --delete-folders fixture-files --delete-files fixture-files --private

Using repo : /Users/ahughes/erase/lwmdb/.git

Found 134 objects to protect
Found 17 commit-pointing refs : HEAD, refs/heads/main, refs/remotes/origin/HEAD, ...

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit 63f18ff4 (protected by 'HEAD') - contains 17 dirty files :
    - fixture-files/JISC papers.csv (14.2 KB)
    - fixture-files/UKDA-8613-csv/1851_rsd_data.csv (1.4 MB)
    - ...

WARNING: The dirty content above may be removed from other commits, but as
the *protected* commits still use it, it will STILL exist in your repository.

Details of protected dirty content have been recorded here :

/Users/ahughes/erase/lwmdb.bfg-report/2023-06-30/11-23-15/protected-dirt/

If you *really* want this content gone, make a manual commit that removes it,
and then run the BFG on a fresh copy of your repo.

Cleaning
--------

Found 370 commits
Cleaning commits:       100% (370/370)
Cleaning commits completed in 163 ms.

Updating 13 Refs
----------------

    Ref                                              Before     After
    --------------------------------------------------------------------
    refs/heads/main                                | 63f18ff4 | a1649c52
    refs/remotes/origin/asmith-review-docs         | e8196742 | d8a0bed9
    refs/remotes/origin/fix-mitchells-import       | c9032006 | 9dc8c58b
    refs/remotes/origin/geocensus                  | dd31fd0f | 5bf21c44
    refs/remotes/origin/improve-load-json-fixtures | 513738d3 | 56e47072
    refs/remotes/origin/item-max-title-field       | 6339b3e3 | b9e2e8c9
    refs/remotes/origin/jupyterhub                 | 9e716305 | 6d7cd451
    refs/remotes/origin/kallewesterling/issue35    | c8429d77 | aec87a1c
    refs/remotes/origin/kallewesterling/issue56    | ebf57d41 | 6e04d95a
    refs/remotes/origin/main                       | 63f18ff4 | a1649c52
    refs/remotes/origin/mkdocs                     | 29b13aec | f8d69bfb
    refs/remotes/origin/production-deploy          | 738bfbab | dc84a5de
    refs/remotes/origin/thobson/issue47            | 0fed749d | 31999d4d

Updating references:    100% (13/13)
...Ref update completed in 30 ms.

Commit Tree-Dirt History
------------------------

    Earliest                                              Latest
    |                                                          |
    ......................DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD

    D = dirty commits (file tree fixed)
    m = modified commits (commit message or parents changed)
    . = clean commits (no changes to file tree)

                            Before     After
    -------------------------------------------
    First modified commit | ce708d9f | e16706f4
    Last dirty commit     | c9032006 | 9dc8c58b

In total, 489 object ids were changed. Full details are logged here:

    /Users/ahughes/erase/lwmdb.bfg-report/2023-06-30/11-23-15

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive
(playground) ➜  lwmdb git:(main) git reflog expire --expire=now --all && git gc --prune=now --aggressive
Enumerating objects: 2299, done.
Counting objects: 100% (2299/2299), done.
Delta compression using up to 10 threads
Compressing objects: 100% (2184/2184), done.
Writing objects: 100% (2299/2299), done.
Total 2299 (delta 1400), reused 589 (delta 0), pack-reused 0
AoifeHughes commented 1 year ago

I don't have permissions to write, but does this look like what you had @griff-rees I used the jar file directly from linked site.

griff-rees commented 1 year ago

Cool! I think I got that far, it was the push to main that failed

griff-rees commented 1 year ago

I need to sort your permission. And I'm going to make another merge to main, so it'll be one more checkout then have another go.

AoifeHughes commented 1 year ago

https://github.com/rtyley/bfg-repo-cleaner/issues/36#issuecomment-460922708 - see this comment

griff-rees commented 1 year ago

Yeah I saw that when I hit this before. Had other urgent stuff so left it

griff-rees commented 1 year ago

@AoifeHughes you've got admin rights. With great power... ;)

AoifeHughes commented 1 year ago

Okay, just for reference I got the same errors as @griff-rees, I tried removing branch protections and also git push -f --set-upstream origin main couldn't get it to budge

griff-rees commented 1 year ago

Thanks so @AoifeHughes: really helps to reproduce that (and know I didn't miss something obvious!). There are other routes that don't use bfg... but they're hard.

griff-rees commented 1 year ago

Another option: https://github.com/newren/git-filter-repo

AoifeHughes commented 1 year ago

@griff-rees can you check if this has been done, I think I got it working? git-filter-repo --invert-paths --path fixture-files was used for this FYI

griff-rees commented 1 year ago

Ah lovely! I think we need to check the history to be sure. Probably need to add to .gitignore to be safe, but I think the hardest part's done. Lovely, lovely work.

AoifeHughes commented 1 year ago

closing as data is gone 😄