Open gouku opened 3 years ago
Also there is another issue that breaks all tags when import bookmarks. I found a Pocket related issue https://github.com/ArchiveBox/ArchiveBox/issues/725 but also affects Pinboard. This two issues currently block me from migrating from Pinboard.
It's already supposed to be using the original timestamps, so it must be a bug in that code: https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/parsers/generic_json.py#L38 or something weird with your JSON export.
Can you post the first dozen lines of log output from when you add it, I want to make sure that it's actually recognizing it as JSON
and not just Generic TXT
.
➜ ~/D/o/A/data ⎇ (dev) # archivebox add --depth=1 'https://slatestarcodex.com/#234234' --overwrite
[i] [2021-07-07 10:11:35] ArchiveBox v0.6.3: archivebox add --depth=1 https://slatestarcodex.com/#234234 --overwrite
> /Users/squash/Documents/opt/ArchiveBox/data
[+] [2021-07-07 10:11:36] Adding 1 links to index (crawl depth=1)...
> Saved verbatim input to sources/1625652696-import.txt
> Parsed 1 URLs from input (Generic TXT) # <---------------------- this is the line I care about
[*] Starting crawl of 1 sites 1 hop out from starting point
> Downloading https://slatestarcodex.com/#234234 contents
> Saved verbatim input to sources/1625652696.173032-crawl-slatestarcodex.com.txt
> Parsed 1012 URLs from input (Generic TXT)
> Found 299 new URLs not already in index
[*] [2021-07-07 10:11:36] Writing 299 links to main index...
√ ./index.sqlite3
[▶] [2021-07-07 10:11:37] Starting archiving of 300 snapshots in index...
[+] [2021-07-07 10:11:37] "slatestarcodex.com/#234234"
https://slatestarcodex.com/#234234
> ./archive/1625652696.173032
> title
> favicon
> headers
I think it's recognized as JSON:
run archivebox add < ~/archivebox/pinboard_export.2021.07.07_07.35.json
Creating archivebox_archivebox_run ... done
[i] [2021-07-07 07:43:25] ArchiveBox v0.6.2: archivebox add
> /data
[+] [2021-07-07 07:43:28] Adding 17213 links to index (crawl depth=0)...
> Saved verbatim input to sources/1625643808-import.txt
> Parsed 17212 URLs from input (Generic JSON)
> Found 16933 new URLs not already in index
[*] [2021-07-07 07:43:42] Writing 16933 links to main index...
Traceback (most recent call last):
File "/app/archivebox/index/sql.py", line 41, in write_link_to_sql_index
info["timestamp"] = Snapshot.objects.get(url=link.url).timestamp
File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 429, in get
raise self.model.DoesNotExist(
core.models.DoesNotExist: Snapshot matching query does not exist.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 589, in update_or_create
obj = self.select_for_update().get(**kwargs)
File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 429, in get
raise self.model.DoesNotExist(
core.models.DoesNotExist: Snapshot matching query does not exist.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/usr/local/lib/python3.9/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
return Database.Cursor.execute(self, query, params)
sqlite3.OperationalError: database is locked
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/bin/archivebox", line 33, in <module>
sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
File "/app/archivebox/cli/__init__.py", line 140, in main
run_subcommand(
File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore
File "/app/archivebox/cli/archivebox_add.py", line 103, in main
add(
File "/app/archivebox/util.py", line 114, in typechecked_function
return func(*args, **kwargs)
File "/app/archivebox/main.py", line 602, in add
write_main_index(links=new_links, out_dir=out_dir)
File "/app/archivebox/util.py", line 114, in typechecked_function
return func(*args, **kwargs)
File "/app/archivebox/index/__init__.py", line 232, in write_main_index
write_sql_main_index(links, out_dir=out_dir)
File "/app/archivebox/util.py", line 114, in typechecked_function
return func(*args, **kwargs)
File "/app/archivebox/index/sql.py", line 88, in write_sql_main_index
write_link_to_sql_index(link)
File "/app/archivebox/util.py", line 114, in typechecked_function
return func(*args, **kwargs)
File "/app/archivebox/index/sql.py", line 46, in write_link_to_sql_index
snapshot, _ = Snapshot.objects.update_or_create(url=link.url, defaults=info)
File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 594, in update_or_create
obj, created = self._create_object_from_params(kwargs, params, lock=True)
File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 610, in _create_object_from_params
obj = self.create(**params)
File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 447, in create
obj.save(force_insert=True, using=self.db)
File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 753, in save
self.save_base(using=using, force_insert=force_insert,
File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 790, in save_base
updated = self._save_table(
File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 895, in _save_table
results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 933, in _do_insert
return manager._insert(
File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 1254, in _insert
return query.get_compiler(using=using).execute_sql(returning_fields)
File "/usr/local/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1397, in execute_sql
cursor.execute(sql, params)
File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/usr/local/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/usr/local/lib/python3.9/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
return Database.Cursor.execute(self, query, params)
django.db.utils.OperationalError: database is locked
Also I got errors when import bookmarks. Only about 200 bookmarks imported.
Replicating my comment here from the PR:
We already have a field for the original timestamp, it's
Snapshot.bookmarked
(akaSnapshot.timestamp
, the snapshot primary key)!.added
is for when Django first became aware of the URL, and.bookmarked
is for any timestamp read in from an import source like bookmarks, RSS, pocket feed, etc. We need both to remain separate, but we could add an option to prefer displaying.bookmarked
or.added
in the UI in different places.
(So any bugfix should fix the code that sets Snapshot.timestamp, not change Snapshot.added) Sorry for the confusion, I guess we could have clearer comments in the code explaining the differences between each field.
Maybe we could use this to build a better pinboard importer: https://github.com/davep/tinboard
The feature suggested in issue #1367 would allow sorting/display by the bookmarked
date instead of added
. (I think that should probably be the default, but I'll make that case over there when I've thought it through more completely.)
We should add some test cases to verify that the timestamps are being parsed and stored correctly, but I have verified via manual testing that it does work with the JSON parser including the fixes on the dev branch.
Issue #1188 (about duplicate timestamps on imports) is another issue that can cause problems with Pinboard imports because of the assumption that specific timestamps are associated with a unique URL. This should be fixed when bookmarked
(aka timestamp
) is no longer used as a unique identifier for snapshots, which is issue #74. (That will also lay the groundwork for having multiple snapshots of the same URL, which is issue #179.)
Type
What is the problem that your feature request solves
I'd like to migrate from Pinboard to ArchiveBox. I have over 10,000 bookmarks since 2009. Export them as JSON and import them into ArchiveBox but found that all bookmarks have the same timestamps.
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
The import process could keep the original timestamp that's already in the JSON export:
What hacks or alternative solutions have you tried to solve the problem?
Manually edit 10,000 imported bookmarks? I don't think so...
How badly do you want this new feature?