ArchiveBox / ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
https://archivebox.io
MIT License
22.41k stars 1.19k forks source link

Bug: Pinboard JSON parser doesn't keep original bookmarked timestamp when adding URLs #785

Open gouku opened 3 years ago

gouku commented 3 years ago

Type

What is the problem that your feature request solves

I'd like to migrate from Pinboard to ArchiveBox. I have over 10,000 bookmarks since 2009. Export them as JSON and import them into ArchiveBox but found that all bookmarks have the same timestamps.

docker-compose run archivebox add < ~/archivebox/pinboard_export.2021.07.07_06.33.json

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

The import process could keep the original timestamp that's already in the JSON export:

...
{
  "href":"https:\/\/github.com\/yuliskov\/SmartTubeNext",
  "time":"2021-06-05T18:55:20Z",
  "shared":"no",
  "toread":"no",
  "tags":"youtube android"
},

What hacks or alternative solutions have you tried to solve the problem?

Manually edit 10,000 imported bookmarks? I don't think so...

How badly do you want this new feature?


gouku commented 3 years ago

Also there is another issue that breaks all tags when import bookmarks. I found a Pocket related issue https://github.com/ArchiveBox/ArchiveBox/issues/725 but also affects Pinboard. This two issues currently block me from migrating from Pinboard.

pirate commented 3 years ago

It's already supposed to be using the original timestamps, so it must be a bug in that code: https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/parsers/generic_json.py#L38 or something weird with your JSON export.

Can you post the first dozen lines of log output from when you add it, I want to make sure that it's actually recognizing it as JSON and not just Generic TXT.

➜ ~/D/o/A/data ⎇ (dev) # archivebox add --depth=1 'https://slatestarcodex.com/#234234' --overwrite
[i] [2021-07-07 10:11:35] ArchiveBox v0.6.3: archivebox add --depth=1 https://slatestarcodex.com/#234234 --overwrite
    > /Users/squash/Documents/opt/ArchiveBox/data

[+] [2021-07-07 10:11:36] Adding 1 links to index (crawl depth=1)...
    > Saved verbatim input to sources/1625652696-import.txt
    > Parsed 1 URLs from input (Generic TXT)                # <---------------------- this is the line I care about

[*] Starting crawl of 1 sites 1 hop out from starting point
    > Downloading https://slatestarcodex.com/#234234 contents
    > Saved verbatim input to sources/1625652696.173032-crawl-slatestarcodex.com.txt
    > Parsed 1012 URLs from input (Generic TXT)
    > Found 299 new URLs not already in index

[*] [2021-07-07 10:11:36] Writing 299 links to main index...
    √ ./index.sqlite3

[▶] [2021-07-07 10:11:37] Starting archiving of 300 snapshots in index...

[+] [2021-07-07 10:11:37] "slatestarcodex.com/#234234"
    https://slatestarcodex.com/#234234
    > ./archive/1625652696.173032
      > title
      > favicon
      > headers
gouku commented 3 years ago

I think it's recognized as JSON:

run archivebox add < ~/archivebox/pinboard_export.2021.07.07_07.35.json 
Creating archivebox_archivebox_run ... done
[i] [2021-07-07 07:43:25] ArchiveBox v0.6.2: archivebox add
    > /data

[+] [2021-07-07 07:43:28] Adding 17213 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1625643808-import.txt
    > Parsed 17212 URLs from input (Generic JSON)
    > Found 16933 new URLs not already in index

[*] [2021-07-07 07:43:42] Writing 16933 links to main index...
Traceback (most recent call last):
  File "/app/archivebox/index/sql.py", line 41, in write_link_to_sql_index
    info["timestamp"] = Snapshot.objects.get(url=link.url).timestamp
  File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 429, in get
    raise self.model.DoesNotExist(
core.models.DoesNotExist: Snapshot matching query does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 589, in update_or_create
    obj = self.select_for_update().get(**kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 429, in get
    raise self.model.DoesNotExist(
core.models.DoesNotExist: Snapshot matching query does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
    return Database.Cursor.execute(self, query, params)
sqlite3.OperationalError: database is locked

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 33, in <module>
    sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
  File "/app/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/app/archivebox/cli/archivebox_add.py", line 103, in main
    add(
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/main.py", line 602, in add
    write_main_index(links=new_links, out_dir=out_dir)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/index/__init__.py", line 232, in write_main_index
    write_sql_main_index(links, out_dir=out_dir)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/index/sql.py", line 88, in write_sql_main_index
    write_link_to_sql_index(link)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/index/sql.py", line 46, in write_link_to_sql_index
    snapshot, _ = Snapshot.objects.update_or_create(url=link.url, defaults=info)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 594, in update_or_create
    obj, created = self._create_object_from_params(kwargs, params, lock=True)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 610, in _create_object_from_params
    obj = self.create(**params)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 447, in create
    obj.save(force_insert=True, using=self.db)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 753, in save
    self.save_base(using=using, force_insert=force_insert,
  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 790, in save_base
    updated = self._save_table(
  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 895, in _save_table
    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 933, in _do_insert
    return manager._insert(
  File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 1254, in _insert
    return query.get_compiler(using=using).execute_sql(returning_fields)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1397, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
    return Database.Cursor.execute(self, query, params)
django.db.utils.OperationalError: database is locked

Also I got errors when import bookmarks. Only about 200 bookmarks imported.

pirate commented 1 year ago

Replicating my comment here from the PR:

We already have a field for the original timestamp, it's Snapshot.bookmarked (aka Snapshot.timestamp, the snapshot primary key)! .added is for when Django first became aware of the URL, and .bookmarked is for any timestamp read in from an import source like bookmarks, RSS, pocket feed, etc. We need both to remain separate, but we could add an option to prefer displaying .bookmarked or .added in the UI in different places.

(So any bugfix should fix the code that sets Snapshot.timestamp, not change Snapshot.added) Sorry for the confusion, I guess we could have clearer comments in the code explaining the differences between each field.

pirate commented 9 months ago

Maybe we could use this to build a better pinboard importer: https://github.com/davep/tinboard

jimwins commented 8 months ago

The feature suggested in issue #1367 would allow sorting/display by the bookmarked date instead of added. (I think that should probably be the default, but I'll make that case over there when I've thought it through more completely.)

We should add some test cases to verify that the timestamps are being parsed and stored correctly, but I have verified via manual testing that it does work with the JSON parser including the fixes on the dev branch.

Issue #1188 (about duplicate timestamps on imports) is another issue that can cause problems with Pinboard imports because of the assumption that specific timestamps are associated with a unique URL. This should be fixed when bookmarked (aka timestamp) is no longer used as a unique identifier for snapshots, which is issue #74. (That will also lay the groundwork for having multiple snapshots of the same URL, which is issue #179.)