Open andylemin opened 2 months ago
The complexity / fragility argument I definitely buy. It's a lot of scaffolding for this one feature. I wonder though if the metadata search can be tweaked to make it more generally useful? Do you have examples from the wild of "useless" metadata filtering?
Regarding the MySQL proposal: Do you have a POC lined up already that demonstrates how this can be set up with the existing CNID backend? It's not immediately obvious to me how the CNID backend can be plugged in as a Spotlight backend as-is, but perhaps it is straight-forward.
As a counter-argument to using MySQL for this purpose: We still arguably end up with a "difficult" solution since setting up and administering a MySQL database is non-trivial for most users. Can a simpler database be used for this, e.g. SQLite?
Finally, as yet another path forward: Build a fully indexed Spotlight backend using Elasticsearch. This is what Samba has done. And since Samba and Netatalk share DNA, it might even be fairly easy to port over their solution.
Perhaps an overload of options in one comment. My apologies. :)
This already exists for dbd afaict
search db = BOOLEAN (default: no) (V) Use fast CNID database namesearch instead of slow recursive filesystem search. Relies on a consistent CNID database, i.e. Samba or local filesystem access lead to inaccurate or wrong results. Works only for "dbd" CNID db volumes.
Hi @rdmark, Sorry for the slow reply.
I think you are underselling how easy Netatalk makes it :) Thanks to all the hard work by the Netatalk contributors, getting the MySQL backend working is simple.
For example, for a production setup on FreeBSD and ZFS pool with NVMe Special Device/Vdev;
zfs create -o atime=off -o recordsize=32k -o special_small_blocks=16k -o compression=lz4 -o xattr=sa -o dnodesize=auto -o checksum=blake3 -o redundant_metadata=most -o sync=disabled -o mountpoint=/var/db/mysql tank/cnid
If you don't have a special device vdev, remove special_small_blocks=16k
, and set recordsize=16k
instead.
sync=disabled
because it does not matter if the volume is lost/corrupted, as it takes just minutes to recreate (if you have a box with plenty of RAM and a warm ARC cache), so speed is more important.
I personally recommend not changing the default MySQL data dir, as changing the default location can be painful, instead just mount the ZFS volume over the top of /var/db/mysql as above with mountpoint=/var/db/mysql
.
pkg search mysql
pkg install mysql84-server mysql84-client
sudo service mysql-server enable
sudo service mysql-server start
sudo mysql_secure_installation
mysql -u root -p
CREATE DATABASE cnid;
Not sure if this is actually needed (it might be created automatically now) but I always do this out of habit.
Arguably the hardest part about this is the afp.conf config;
https://netatalk.io/3.2/htmldocs/configuration - says very little, and does not mention anything about how Netatalk does the hard work for you of creating the DB Schema etc.
https://netatalk.io/3.2/htmldocs/afp.conf.5 - lists the 4 Global options under Miscellaneous Options, but does not give you any hints that you also need an associated setting for each of the volumes.. (this caught me out for ages). Ie, to search for cnid scheme
in a different part of the docs. Having a complete example config (like below) in one place in the docs would help.
/usr/local/etc/afp.conf
[Global]
cnid mysql host = localhost
cnid mysql user = root
cnid mysql pw = <PASSWORD>
cnid mysql db = cnid
[share1]
cnid scheme = mysql
/usr/local/etc/rc.d/netatalk restart
Netatalk will create the required MySQL tables and schema automatically :)
dbd -f -F /usr/local/etc/afpd.conf
https://netatalk.io/3.0/htmldocs/dbd.1
With the default CNID backend, this takes several hours for a share with a few million objects. With MySQL CNID backend, this takes 10-20 minutes for me (around 10x faster in this setup).
Regarding "We still arguably end up with a "difficult" solution since setting up and administering a MySQL database is non-trivial for most users." - I don't see this, if a user can setup a Linux file server with Netatalk they could also setup MySQL as above. As long as the user is not running MySQL for other applications on the same box, it is mostly just setup and forget.. Sure there are a few steps here, but it is almost zero maintenance after installed. If you have any issues, you can just destroy and recreate the DB.
For example, the default tuning values for MySQL 8+ allow for a Netatalk share with several million files, long before you have to increase any memory limits, or do any fancy tuning. And even if you do, there are tools like mysqltuner.pl
etc.
Worst case you might have to rebuild the CNID DB with dbd
occasionally.
To be fair, you might have to fiddle with the boot scripts to ensure that the ZFS Vol is mounted before MySQL starts, and that Netatalk starts last after MySQL. And users should probably have MySQL listen on 127.0.0.1 only. Otherwise it is very little effort for a stable and significant speed boost.
Regarding user cases;
Tracker
and Elasticsearch
are intended to index file contents, not just file names. This would be useful for users hosting code and document file shares. More complex search for more complex file share.
I would guess that most users are using Netatalk for media file shares, where filename search is more useful. Simple search for simple file share.
This is what brought me to the idea of just using the MySQL backend for the simple case. It already has all the filenames, it is fast, and it is updated in realtime (as soon as a file is created (via Netatalk), it can be found in the MySQL backend).. So no changes are needed for the MySQL schema, only some Netatalk code to run an SQL query for a given Spotlight query.. I would propose this is a small change for a majority user case.
NB; For files added to the share volumes not via Netatalk, I run the dbd
command without -f
via cron regularly, to update the CNID.
Regarding a POC; Would you like me to play with some example SQL queries against the cnid DB?
Thanks for your consideration
Looks like it could be as simple as; select * from <VolUUID> where Name LIKE '%<SpotlightQueryString>%';
Ie, Substituting '*' wildcard in Spotlight query for '%' should do the trick and is case insensitive too 🙂
There is no index on the Name field (obviously), so it should be O(n) time relative to table length. A quick test showed this LIKE sequential scan can search 1M records in ~500ms consistently, and of course does not touch the spinning disks.
Keen to hear your thoughts?
I think you are underselling how easy Netatalk makes it :) Thanks to all the hard work by the Netatalk contributors, getting the MySQL backend working is simple.
Impressive research! You say "simple", and then proceed with 3 screens worth of instructions. ;)
Joke aside, I get that it's not "hard" per se, if you're familiar with operating a system like this. A good chunk of Netatalk users today are hobbyists who often don't have sysadmin experience, which is why I'm trying to streamline the installation process as much as possible.
The MySQL backend has a documentation deficit, for certain. Would you be open to filing a PR that adds the example, and any additional context that you think would help? The docs are defined in XML format here: https://github.com/Netatalk/netatalk/tree/main/doc
I do love the real-time updates and fast search of your proposed solution. It's definitely more practically useful than Tracker. What would be the next step?
Haha, I knew you would make a comment about the length of my example. 😉 It started out short..
Absolutely! I would be happy to build a documentation PR.
I can provide some narrative about the auto-setup goodness, and a complete afp.conf example. A generic on-EXT example for home Linux users, and an on-ZFS production example.
The next steps are tricker for myself, as I have not written any C for over 20 years.. I would be interested in attempting it, but I would need time from someone patient and willing to answer my newbie questions, and who could nudge me in the right direction to ease the learning curve on the code base?
I am frankly an amateur myself but am of course offering to assist to the best of my ability.
In fact, I would say that the last 3 years of this project has been characterized by a bunch of non-Cdevs (re)learning C and figuring things out as we go along. :)
Is your feature request related to a problem? Please describe.
Describe the solution you'd like
I have seen very few, if any, truly useful cases in production of searching via the metadata that Tracker provides. The vast majority of searches are just filename searches, and the metadata often just adds noise to the results. As a result you often have to configure the tracker to not scan for extra data anyway.
Being able to have a simple name only search using the existing MySQL would be fast, simple, stable and efficient.
It would also allow for instant Spotlight searches without having to wait for the index scans to catch up.