blakearchive / archive

GNU General Public License v2.0
5 stars 7 forks source link

very slow importing on macos Sierra #338

Open ba001 opened 7 years ago

ba001 commented 7 years ago

i just updated my work mac to Sierra, and now i'm having a problem importing the BADs. the process goes up by 3 or 4 percentage points and then stalls. i verified that the problem cropped up after updating to Sierra by replicating the problem on my home mac, which hadn't yet been updated and was importing just fine.

ba001 commented 7 years ago

no rush on this--just whenever you get a chance, can you look into this? i realized it's going to be quite a job for me to downgrade back to El Capitan

nathan-rice commented 7 years ago

There is no way for me to debug an issue that I can't replicate.

ba001 commented 7 years ago

Of course not. The question now is: why can't you replicate it? Because you aren't using Sierra? Or because you are, and the importing works just fine?

nathan-rice commented 7 years ago

My development environment is Linux.

queryluke commented 7 years ago

FWIW - I upgraded to Sierra today to check this out. And I can confirm the slow import. Here are some steps I tried, but to no avail.

Info files: 100%|██████████████████████████████| 71/71 [00:00<00:00, 607.75it/s]
BAD files:   0%|                                          | 0/1 [00:00<?, ?it/s]Trying ../../replice/works/america.a.xml
BAD files: 100%|██████████████████████████████████| 1/1 [01:01<00:00, 61.85s/it]
Traceback (most recent call last):
  File "import.py", line 451, in <module>
    main()
  File "import.py", line 447, in main
    importer.import_data()
  File "import.py", line 67, in import_data
    self.process_relationships()
  File "import.py", line 111, in process_relationships
    self.process_relationship(entry)
  File "import.py", line 116, in process_relationship
    obj.objects_from_same_matrix.extend(self.objects_for_id_string(entry.same_matrix_ids))
AttributeError: 'NoneType' object has no attribute 'objects_from_same_matrix'

The Trying ../../replice/works/america.a.xml, is just something I added to see if it was a particular BAD causing the hang-up. It's not.

You can see it took < seconds to "process" parse all of the info files. But it took 60s to "process" the single info file. Not sure how much this info helps...but that is about all I can offer.

I'll be updating the deployment scripts to backup the database next week. That way, if the import fails, we can revert the database. Once I have those scripts written, I can write some documentation on how to "sync" dev with your local machine. It's not a pretty way to develop, but it will have to suffice for now.

nathan-rice commented 7 years ago

The first thing I suggest you guys do is update your libxml and python LXML libraries to the most current available versions, as that might solve the problem. Once that is done, I'll look into creating a script you can use to profile the execution of the import to see what is going on.

ba001 commented 7 years ago

Updated python LXML libraries and libxml. No luck

nathan-rice commented 7 years ago

I've updated the import script to have a --profile option. Run import.py with this option added, but as you otherwise normally would. Let it run for a while (the longer the better, but at least an hour or two) and then use command-c to terminate the import process. A file called import_stats.out should be created, post it here and I'll see what I can do about the source of the slowdown.

ba001 commented 7 years ago

getting an error:

(blake) english00024:blakearchive michaelfox$ python import.py ../../data --profile Traceback (most recent call last): File "import.py", line 455, in main() File "import.py", line 449, in main cProfile.run("importer.import_data()", "import_stats.out") File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/cProfile.py", line 29, in run prof = prof.run(statement) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/cProfile.py", line 135, in run return self.runctx(cmd, dict, dict) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/cProfile.py", line 140, in runctx exec cmd in globals, locals File "", line 1, in NameError: name 'importer' is not defined

something i should change?

nathan-rice commented 7 years ago

This should be operating properly now.

ba001 commented 7 years ago

attached:

import_stats.out.zip

nathan-rice commented 7 years ago

I just pushed a version with an optimization, see how that works. If it is still too slow, post another profile log using that version.

ba001 commented 7 years ago

Little better but still very slow. I stopped it after a couple hours (I think) at 60 some percent. Here are the new stats:

import_stats.out 2.zip

nathan-rice commented 7 years ago

Your profiler is giving me very incomplete information for some reason. I've updated the import script to try and work around the issue, please run the latest code and post another log.

ba001 commented 7 years ago

import_stats.out 3.zip

nathan-rice commented 7 years ago

Alright, the problem is the XSLT transform is running very slowly on your system. I'm not sure why this is the case, but in any event it isn't fixable from our end.

ba001 commented 7 years ago

would it help if i gave you a login to my Mac (Sierra) at my office so that you can try to debug from it? i can set up the dev environment for you under that login.

nathan-rice commented 7 years ago

Not really. I don't want to descend into debugging a large and unfamiliar external library, that's going to eat up more time than the workarounds, particularly given my lack of familiarity with the nuances of macos internals.

My personal suggestion would be to get virtualbox and run a linux VM.

ba001 commented 7 years ago

ok, i can try that. in the meantime, are there other xml libraries we could try besides lxml?

nathan-rice commented 7 years ago

None that do XSLT.

As a side note, apparently the XSLT facilities of LXML aren't part of libxml2, but libxslt. Perhaps you could try upgrading that library?

ba001 commented 7 years ago

looks like i've got the latest version, 1.1.29. maybe this is something for stackexchange or an lxml (libxslt) bug report