Open bmcgough opened 7 years ago
messiness is subjective.
I don't think this is messy:
pwalk --NoSnap --maxthreads=32 /mnt/really_big_filesystem/bobs_folder | sed -e 's/$/,bob,bigfs/'
Yes, you are right.
I am also working with the author to eventually enable PostgreSQL binary output for direct COPY into tables.
To stick with that philosophy, a PostgreSQL binary format manipulator utility will need to be written (I have not found one). Perhaps that is what should be done instead of adding that functionality to pwalk... .
Hi - If you're having a conversation on-line about PostgreSQL loading options I'd like to join it.
In fact, let me start one, in the wiki: Postgresql loading practices
I added our general workflow to the wiki page.
If there is interest, we should create an issue for PostgreSQL binary output. Also for any common improvements/next steps.
I'm happy to talk more about the challenges we have faced, and how we have dealt with queries, indexes, etc. .
@bmcgough - 2 things...
a) you agreed with me above with my opinion that with a little judicious sed
this feature request might be laid to rest. If so, how about closing it?
b) I would very much like to hear about some the the challenged and solutions with queries, indexes, etc. Maybe you could add it to the wiki here? I'm especially interested if you have made this work with an organizational meta-data management mandates (aka MMMs. Not.). But seriously, how are you using this at scale and to what ends? Or did the big vendor win the showdown. And, if so, who is the BiG VeNdOr?
I agree that some sed/awk foo is an acceptable workaround for my current case.
We have been using pwalk for almost two years now. It is working here. It is messy. Our views and other things are there, though there are some missing scripts (like UID and GID table creation) that we only keep locally.
We use pwalk to crawl 1.4TB in about 500 million files. The data is gathered by pwalk to a CVS file on a scratch file system. Then a pipeline of cvsquote, uconv, sort, awk, and finally psql fills the database. We have to create indexes, and in some cases materialized views to get the query performance we need from PostgreSQL. Recently we have put this data into an Elasticsearch cluster, and found querying to be much faster, but still not as fast as we need to put querying into the hands of our users.
The challenges include filenames, error handling, and run time.
During this project I have learned there is only 1 character not permitted in a filename - /. We even have file names with byte values that do not map to known character sets (thus the uconv). We use awk to supply the additional data to the row, and awk isn't really CSV-aware, thus we need to use csvquote. Psql is very picky (we are using COPY TO) so sort eliminates any duplicates, but is an additional step.
Before pwalk, I used Python with scandir to walk the tree, putting filenames onto a multiprocessing queue to be stated. This meant there was a (sometimes long) windows between directory walk and file stat, so files would be deleted during normal use and this would result in an error (file not found). With pwalk things are better, but that race condition still exists, and we hit it occasionally. So we have to differentiate between 'acceptable' error and not.
It takes us a varying amount of time to run the crawl - 8-16 hours. This is likely all due to the underlying file system at this point, but it is always a challenge. Scheduling on our slurm cluster has also been a challenge as we are competing with other cluster users for resources.
We are on the cusp of launching the ES solution and using it to pull in additional metadata from our Swift cluster, S3, and our scratch file system (currently BeeGFS). But pwalk will remain the method we use to gather POSIX file metadata.
As to metadata mandates... I think getting our metadata 'ducks in a row' is step one for us. Once we have the ability to view and query all our metadata, we will finally have the tools necessary for users to begin to voluntarily manage their data and metadata. It'll be an exciting time!
This is great info - thanks @bmcgough - one question though - what does ES stand for in "the ES solution"? "Enterprise Scale"?
Elasticsearch. This is what @fizwit used to use to query pwalk data. We had a PostgreSQL project ongoing and decided to use that. It does work, but you do have to optimize queries to get the performance you want.
Ideas for the future include:
Grab md5sum where easy (S3, Swift, etc.) Read file magic On-demand tree walking (user-triggered from a UI) Data moving (again, user-triggered from a UI, but data avoiding client devices)
Goal is to have pwalk output a column value the user supplies on the command line.
he example is file and folder ownership - suppose you have an owner for a folder structure that is not the same as the UID owing the folder. You would specify that field, and it would be appended to each line of output as an additional column. Yes, it would be exactly the same for every line of output, but if the output is being copied into a database directly, this would help to avoid messiness in shell around altering the output.
Perhaps a repeatable parameter for allow multiple columns?
Ex:
pwalk --NoSnap --maxthreads=32 --addcol 'bob' --addcol 'bigfs' /mnt/really_big_filesystem/bobs_folder
would produce lines like:62943850,69436284,1,"/mnt/really_big_filesystem/bobs_folder/testfile.tst","tst",1287,1287,4096,8,"0040644",1475167363,1475167363,1475167363,4,3001,bob,bigfs