Questions about dependencies and other various remarks

Hello Benedikt,

DISCLAIMER: I know I'm not the best when it comes to formulate these things in an agreeable manner, so sorry if this issue seems confrontational and abrasive, that is not my intention :pray:

I was browsing the code and found things that raised questions. In this file:

https://github.com/BenediktHeinrichs/metadataextractor/blob/7996bee0ebd305b703e61189af4d61788a929adb/installDependencies.sh

Why is default-libmysqlclient-dev necessary? What is requiring it?

I'd also recommend using curl instead of wget. git depends on libcurl already, so one could argue adding curl has less impact than wget, which brings more dependencies than curl, too.

I'm also wondering about playwright, is that a dev thing? Why is chromium needed in the docker image?

I'd highly recommend having a PROD image that only contains the necessary things for prod, and for dev you build on top of it a dev image by adding what you need for tests and similar needs.

I'd also recommend using multi-stage builds. Here is an example with poetry and a fully copiable env folder to a final image: https://github.com/manulera/ShareYourCloning_backend/blob/master/Dockerfile

BTW, last line is not necessary, as run.sh is already executable ;)

Regarding python packages management, I don't know how you create the requirements.txt but it might be interesting to look into poetry to manage these dependencies.

Regarding logging, in Docker you must log to stdout and stderr of the container so it appears with docker logs command but can also be forwarded to an external log centralization system. The logs folder must not exist in the container. The debug flag must not be set for prod. Here I can see the debug.log is already filled.

The other major issue is that your docker image ends up being at 9.33 Gb. This is way too much. It is important to keep it at a few hundred megabytes max. 1 Go is already a lot, but here it's almost 10!

I'm also not a big fan of the request being able to modify the config. I haven't looked into it precisely, but it sounds like a very nice way for a malicious user to change a config setting such as stop_hack_attempts = True to stop_hack_attempts = False, you get the idea. Or one could modify the endpoints (e.g. PIKESAPIENDPOINT) to an attacker controlled endpoint and then the server would happily hit it and process the (malicious) response.

Allowing clients to modify the config is a serious vulnerability. Please remove this functionnality. Only the user starting the container should be able to modify this. Ideally with ENV vars (see https://www.12factor.net/).

Why is dynamic_url needed? What is it for? "(loading dynamic JavaScript)" doesn't help much understanding this parameter.

What is the identifier for the file? Doc mentions a unique identifier, but what is it being used for and why should the client send it in the first place? I see it in the output, so I guess it's just a convenient way to have it in the response by adding it to the request?

I'm following up on the releases, and hope one day to have time to really dwelve into its use in eLabFTW, but as I told you the first time we met, such service must be impeccable when it comes to security. Do you think it would be feasible to ask the university to validate this piece of software by a security auditing firm?

Sorry for the massive issue and authoritative tone but I'm simply trying to help you get this piece of software to an acceptable level to be installable in sensitive environments! :)

Best regards, ~Nicolas

Hey Nicolas,

even though I'll be dropping out of this project, I still wanted to answer some of your questions / comment on your remarks.

For the playwright and dynamic_url part, this was an experimental solution to extract metadata from websites that are heavily JavaScript-dependent, which didn't really work out that well though in the end and just added more complexity than necessary → removed that in the end.

Most of the things being installed in installDependencies.sh are there because some build dependencies or libraries use it. This project in general is sadly heavily reliant on libraries and its dependencies to extract fitting metadata (like Apache Tika for example or some AI libraries and models), resulting in this large image.

The large image in general though is something I am aware of, but it is difficult to deal with when you, on the one side, want to have every dependency necessary included, on the other side, would like to have a small image. I could envision different versions of the metadata extractor, like a full version and a minimal one that just has the low dependency metadata extraction algorithms, and maybe that could help.

For the config request, I could envision it just replacing the "Extractors" part, but I am still a fan of the functionality of configuring which extractors should actually be run (so that you don't have to put every extractor into the default config and just can run some extractors on certain workloads). But this can use more thought. See your security concerns, though.

The idea of the identifier evolved over the time from being a GUID to the filename to something customizable, and this is why it is there as the parameter. Could see it being deprecated and removed in future updates.

Good look in the future Benedikt

BenediktHeinrichs / metadataextractor

Questions about dependencies and other various remarks #13