ArchiveBox / ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
https://archivebox.io
MIT License
22.04k stars 1.17k forks source link

New Extractor Idea: Find/write a "`cad-dl`" to save 3d assets, gltf files, CAD files, shapefiles, STLs, etc. #668

Open fire opened 3 years ago

fire commented 3 years ago

My thoughts is a combination of having archive box save the asset as a Blender file and as a gltf2.

However, there's layers of problems here.

Any suggestions are welcome.

I can provide technical support and man-months, but not sure where to start.

pirate commented 3 years ago

What do you mean by 3d assets? Can you provide some examples of URLs and formats that you'd expect it to be able to save?

fire commented 3 years ago

Thanks for your prompt reply.

  1. (gltf without animations) https://3d.si.edu/object/3d/command-module-apollo-11:d8c63e8a-4ebc-11ea-b77f-2e728ce88125
  2. (blend) https://cloud.blender.org/p/gallery/5e46a80442fa9613e1cd1fca
  3. (blend) https://cloud.blender.org/p/gallery/60337d495677e942564cce76
  4. (gltf with culturally significant animations) https://sketchfab.com/3d-models/fortnite-floss-emote-0a52f8e8eaf7441faffd8efc8d8a9e0e
  5. (VRM based on gltf) https://booth.pm/en/items/1050142 (This is a stretch goal, but I would handle this by either ignoring the format, importing into blend or identify it as a gltf file)
  6. (VRM based on gltf) https://github.com/Miraikomachi/MiraikomachiVRM/blob/master/Miraikomachi.vrm

The typical open formats are FBX (no good opensource reader), glTF2(open source), Blend (has an implementation in Blender but is complicated) and USDZ (Newer standard. Do not use due to complexity). I think there are a few cad formats, but they're either convertable to gltf2 or blender.

There are some other formats that aren't mentioned like alembic but the point is to support a well defined, narrow format that can be opened in the future for archival use.

pirate commented 3 years ago

I think the best way to go about this is to find an existing program (or snippet of puppeteer/playwright JS) that can look for these assets on a page (given some html or a url) and add it as an extractor module to archivebox.

I don't know the 3D space at all, so I'm probably not the right person for this, but I'm happy to review PRs or design proposals for such an extractor.

fire commented 3 years ago

If I can script Blender would that be acceptable?

https://github.com/donmccurdy/glTF-Transform is web native.

fire commented 1 year ago

@pirate Can you link me some guides for writing extractors. Also what is the format for design proposals?

I think finding the urls can be worked around by linking a direct url for now.

pirate commented 1 year ago

@fire The process for adding a new extractor is documented here:

Note the main constraint for ArchiveBox right now is deployment complexity, so I'm putting a hold adding new binary dependencies at the moment. I don't think that will necessarily impair your ability to download 3d files as long as they don't need any further 3d processing after download. If you have a pure python package or npm library that can snapshot 3d assets from a URL with minimal packaging complexity and linux/macOS + x86/arm7/arm64 support then I'm down to consider it.

fire commented 1 year ago

The standard tool for gltf is https://gltf-transform.donmccurdy.com/ like ffmpeg in video importing.

I'll have to find a url extractor.

pirate commented 1 year ago

Maybe we can borrow code from this extension: https://github.com/stephancasas/thingiverse-stl-downloader

pirate commented 1 year ago

If anyone knows of any youtube-dl / yt-dlp equivalent program to find + download 3d assets from a URL that would be super useful here. Please comment with any suggestions :)

At the moment I'm still not willing to write custom logic to do this extraction, as it would be too much for me to maintain as a solo developer working on ArchiveBox in my spare time, but if we can find an external program/library that can do it then the task is much easier.

fire commented 1 year ago

Suggest an interface for me, and I might take a try at making one from scratch.

pirate commented 1 year ago

Same CLI as YouTube-dl/yt-dlp would be great. E.g.

shapefile-dl [--max-size=750m] https://example.com/some/page/containing/cad/files

It should output one or more files to the current directory the command is run in, and return 1 exit status + error text if it fails, or 2 exit status if no shape files are found.

Pure Python would be ideal, but js is also ok.

fire commented 1 year ago

Oh. so it needs to be python or javascript, but not like c++ or elixir binaries, hmmm. My plan was to either write one from scratch or use Godot Engine's code I know the details for.

Godot Engine has a wasm platform.

pirate commented 1 year ago

A binary is technically ok, it's just more difficult for us to maintain and for users to install. If it's not Python or JS, then it needs to be packaged via both apt and brew, and we have to update and test more places like the Dockerfiles, CI configs, documentation, setup helper scripts, etc.

fire commented 1 year ago

I think I can use Godot Engine to handle some of these formats in the near future.

  1. FBX - we are developing a Godot Engine opensource reader
  2. glTF2 - we can use Godot Engine to parse the metadata
  3. blender - we can use Godot Engine and Blender in a docker container
  4. USDZ - usd2glb supports converting USDZ to gltf https://github.com/fynv/usd2glb
pirate commented 1 year ago

That sounds like a lot of post processing. I'd like to keep archivebox focused on just initial preservation, not further processing of artifacts beyond that step. Post-processing steps can be done elsewhere in a pipeline by other software working on the output that ArchiveBox produces. If requires a full 3d engine then it's probably beyond our scope.

For now we are still looking for a suitable program that can rip 3D asset files out of an HTML page and into raw files on disk.

pirate commented 8 months ago

@benmuth would also be interested in a solution to this if you want to do some research / see what works for this problem. some sites to try extracting STLs, CAD, gltf, blend, etc. files out of:

fire commented 8 months ago

The good thing is CAD files that don't involve animation are relatively easy, but STEP is hard.

fire commented 8 months ago

I am trying a fork of https://github.com/V-Sekai/USD-Fileformat-plugins for conversion of 3d model formats, but its not trivial at all. Think like 1.6 gigabytes.

pirate commented 8 months ago

One simple solution we could do is run all the URLs in found in a page through something like magika and download anything that has cad, 3d, shapefile, etc. in the detected type output.

https://opensource.googleblog.com/2024/02/magika-ai-powered-fast-and-efficient-file-type-identification.html

fire commented 8 months ago

It is wise to note that the process of determining dependencies might be a lot easier to solve than parsing the entire file.

Like given a fbx file it's easier to parse to find its dependent textures than to convert fbx to glb.

This is related to the only do scanning idea mentioned in the last post.

pirate commented 8 months ago

To clarify again, I don't want ArchiveBox to actually process any 3D files / read their contents, so we don't need any 3d modeling engine integration. I just want it to download whatever is available as-is. People can always have other programs read the output from archivebox.

benmuth commented 7 months ago

I tried to find existing tools to extract these files, but haven't had success yet.

One simple solution we could do is run all the URLs in found in a page through something like magika and download anything that has cad, 3d, shapefile, etc. in the detected type output.

https://opensource.googleblog.com/2024/02/magika-ai-powered-fast-and-efficient-file-type-identification.html

I like this idea, but it looks like magika doesn't support these formats yet. They're accepting suggestions, so maybe we can open an issue for each of these (it looks like .blend files have already been suggested).

I gave it a shot anyway with some of the file types linked in this issue, and here are the results I got (also included file results for reference):

stl magika: ISO 9660 CD-ROM filesystem data (archive) 99% file: data

gltf magika: JSON document (code) 97% file: JSON data

blend magika: gzip compressed data (archive) 100% file: gzip compressed data

STEP magika: Generic text document (text) [Low-confidence model best-guess: CSV document (code), score=41] file: ASCII text, with very long lines (1650), with CRLF line terminators

These are the first formats I found examples of, but I'd like to try more files.

Not sure how stable these results will be across all valid files of each format. The only one I'm confident would be stable is gltf because it's literally JSON.

If we're confident that magika (or even libmagic I guess) would give a stable, meaningful (i.e. not "data" or something) result for a given format, I guess we could just look for links with the correct extension, check to see if the linked resource the expected output for that filetype from magika/libmagic, then download it if so. Seems janky but it might work.

Does that approach make sense? Or should we just wait for official support for each file type from magika? I can try writing a test script to see how well it works if we think it's something worth pursuing.

pirate commented 7 months ago

On further inspection magika is actually pretty disappointing, there are many formats it sucks at detecting.

I think simple extension/content-type based detection is enough for now. Running DOM/Singlefile output through a simple regex to find all URLs that end in relevant extensions (.blend, .stl, .obj, .stp, etc. ) and just wget-ing those would already be super useful.

benmuth commented 7 months ago

I've checked quite a few websites for test cases and can't find any that directly link to 3d assets. I could be looking in the wrong places though. I'd appreciate a link if someone finds one.

pirate commented 7 months ago

These are the key ones I want to support: https://github.com/ArchiveBox/ArchiveBox/issues/668#issuecomment-1958236944

If they don't link directly / download URLs cant be found with regex, then we may need to write a puppeteer script like this to get the files by clicking around the page a bit.