Open fire opened 3 years ago
What do you mean by 3d assets? Can you provide some examples of URLs and formats that you'd expect it to be able to save?
Thanks for your prompt reply.
The typical open formats are FBX (no good opensource reader), glTF2(open source), Blend (has an implementation in Blender but is complicated) and USDZ (Newer standard. Do not use due to complexity). I think there are a few cad formats, but they're either convertable to gltf2 or blender.
There are some other formats that aren't mentioned like alembic but the point is to support a well defined, narrow format that can be opened in the future for archival use.
I think the best way to go about this is to find an existing program (or snippet of puppeteer/playwright JS) that can look for these assets on a page (given some html or a url) and add it as an extractor module to archivebox.
I don't know the 3D space at all, so I'm probably not the right person for this, but I'm happy to review PRs or design proposals for such an extractor.
If I can script Blender would that be acceptable?
https://github.com/donmccurdy/glTF-Transform is web native.
@pirate Can you link me some guides for writing extractors. Also what is the format for design proposals?
I think finding the urls can be worked around by linking a direct url for now.
@fire The process for adding a new extractor is documented here:
Note the main constraint for ArchiveBox right now is deployment complexity, so I'm putting a hold adding new binary dependencies at the moment. I don't think that will necessarily impair your ability to download 3d files as long as they don't need any further 3d processing after download. If you have a pure python package or npm library that can snapshot 3d assets from a URL with minimal packaging complexity and linux/macOS + x86/arm7/arm64 support then I'm down to consider it.
The standard tool for gltf is https://gltf-transform.donmccurdy.com/ like ffmpeg in video importing.
I'll have to find a url extractor.
Maybe we can borrow code from this extension: https://github.com/stephancasas/thingiverse-stl-downloader
If anyone knows of any youtube-dl
/ yt-dlp
equivalent program to find + download 3d assets from a URL that would be super useful here. Please comment with any suggestions :)
At the moment I'm still not willing to write custom logic to do this extraction, as it would be too much for me to maintain as a solo developer working on ArchiveBox in my spare time, but if we can find an external program/library that can do it then the task is much easier.
Suggest an interface for me, and I might take a try at making one from scratch.
Same CLI as YouTube-dl/yt-dlp would be great. E.g.
shapefile-dl [--max-size=750m] https://example.com/some/page/containing/cad/files
It should output one or more files to the current directory the command is run in, and return 1 exit status + error text if it fails, or 2 exit status if no shape files are found.
Pure Python would be ideal, but js is also ok.
Oh. so it needs to be python or javascript, but not like c++ or elixir binaries, hmmm. My plan was to either write one from scratch or use Godot Engine's code I know the details for.
Godot Engine has a wasm platform.
A binary is technically ok, it's just more difficult for us to maintain and for users to install. If it's not Python or JS, then it needs to be packaged via both apt and brew, and we have to update and test more places like the Dockerfiles, CI configs, documentation, setup helper scripts, etc.
I think I can use Godot Engine to handle some of these formats in the near future.
That sounds like a lot of post processing. I'd like to keep archivebox focused on just initial preservation, not further processing of artifacts beyond that step. Post-processing steps can be done elsewhere in a pipeline by other software working on the output that ArchiveBox produces. If requires a full 3d engine then it's probably beyond our scope.
For now we are still looking for a suitable program that can rip 3D asset files out of an HTML page and into raw files on disk.
@benmuth would also be interested in a solution to this if you want to do some research / see what works for this problem. some sites to try extracting STLs, CAD, gltf, blend, etc. files out of:
The good thing is CAD files that don't involve animation are relatively easy, but STEP is hard.
I am trying a fork of https://github.com/V-Sekai/USD-Fileformat-plugins for conversion of 3d model formats, but its not trivial at all. Think like 1.6 gigabytes.
One simple solution we could do is run all the URLs in found in a page through something like magika
and download anything that has cad
, 3d
, shapefile
, etc. in the detected type output.
It is wise to note that the process of determining dependencies might be a lot easier to solve than parsing the entire file.
Like given a fbx file it's easier to parse to find its dependent textures than to convert fbx to glb.
This is related to the only do scanning idea mentioned in the last post.
To clarify again, I don't want ArchiveBox to actually process any 3D files / read their contents, so we don't need any 3d modeling engine integration. I just want it to download whatever is available as-is. People can always have other programs read the output from archivebox.
I tried to find existing tools to extract these files, but haven't had success yet.
One simple solution we could do is run all the URLs in found in a page through something like
magika
and download anything that hascad
,3d
,shapefile
, etc. in the detected type output.
I like this idea, but it looks like magika
doesn't support these formats yet. They're accepting suggestions, so maybe we can open an issue for each of these (it looks like .blend
files have already been suggested).
I gave it a shot anyway with some of the file types linked in this issue, and here are the results I got (also included file
results for reference):
stl
magika
: ISO 9660 CD-ROM filesystem data (archive) 99%
file
: data
gltf
magika
: JSON document (code) 97%
file
: JSON data
blend
magika
: gzip compressed data (archive) 100%
file
: gzip compressed data
STEP
magika
: Generic text document (text) [Low-confidence model best-guess: CSV document (code), score=41]
file
: ASCII text, with very long lines (1650), with CRLF line terminators
These are the first formats I found examples of, but I'd like to try more files.
Not sure how stable these results will be across all valid files of each format. The only one I'm confident would be stable is gltf
because it's literally JSON
.
If we're confident that magika
(or even libmagic
I guess) would give a stable, meaningful (i.e. not "data" or something) result for a given format, I guess we could just look for links with the correct extension, check to see if the linked resource the expected output for that filetype from magika
/libmagic
, then download it if so. Seems janky but it might work.
Does that approach make sense? Or should we just wait for official support for each file type from magika
? I can try writing a test script to see how well it works if we think it's something worth pursuing.
On further inspection magika
is actually pretty disappointing, there are many formats it sucks at detecting.
I think simple extension/content-type based detection is enough for now. Running DOM/Singlefile output through a simple regex to find all URLs that end in relevant extensions (.blend
, .stl
, .obj
, .stp
, etc. ) and just wget
-ing those would already be super useful.
I've checked quite a few websites for test cases and can't find any that directly link to 3d assets. I could be looking in the wrong places though. I'd appreciate a link if someone finds one.
These are the key ones I want to support: https://github.com/ArchiveBox/ArchiveBox/issues/668#issuecomment-1958236944
If they don't link directly / download URLs cant be found with regex, then we may need to write a puppeteer script like this to get the files by clicking around the page a bit.
My thoughts is a combination of having archive box save the asset as a Blender file and as a gltf2.
However, there's layers of problems here.
Any suggestions are welcome.
I can provide technical support and man-months, but not sure where to start.