Closed mfherbst closed 3 years ago
When I tried libxm on Thursday, I had the problem that when specifying a custom pagefile dir, this directory was always deleted (probably because all allocators are being shut down once the memory pool is re-initialised?). I don't know if you had similar issues?
Other than that, I don't have time to test this over the weekend, but I will let you know early next week if everything is fine. My nvme ssd is probably suitable for some testing, and I've got a new USB 3.1 external drive (as fast as internal SSD) for my mac, so that would be cool to test đŦ
I don't know if you had similar issues?
Not really. Everything worked the way I expected it to work. The whole allocator setup in libtensor is very inflexible. So literally the only way you can expect libxm
to operate sensibly is if you run exactly once per python process:
import adcc
adcc.memory_pool.initialise(allocator="libxm")
I suppose you did that, but just to check.
My nvme ssd is probably suitable for some testing
Yes exactly. On those types of systems using libxm or not should not make much of a difference given that the problem is large enough --- at least if the promises made in their publication hold true.
I don't like that the specified pagefile directory is deleted after the run... I vote for creating an extra tmpdir inside pagefile_directory which can then be deleted... This is already the case when /tmp
is used.
I will run a test job once the node with fast SSD is free... So I can test the same job once with RAM/std allocator and once with libxm.
I'm a bit disappointed right now:
Timer 29.7m lifetime
with RAM vs Timer 8.5h lifetime
with libxm đĻ for the same job.
I'm worried I did some mistake with the hardware, but I don't think so....
First thing to check: Which version of libtensor did you use? Did you use the most recent one (freshly compiled and installed) or could it be you might have still had an old one on disk, which got used instead?
Also could well be that the job was too small or you used too many threads. The libxm design only pays off if the time needed to do the contractions is roughly the same as the time needed to fetch the data from disk.
First thing to check: Which version of libtensor did you use? Did you use the most recent one (freshly compiled and installed) or could it be you might have still had an old one on disk, which got used instead?
Yeah, I used a freshly built libtensor+xm
Also could well be that the job was too small or you used too many threads. The libxm design only pays off if the time needed to do the contractions is roughly the same as the time needed to fetch the data from disk.
You're right, with less threads, the runtime for the xm
job stays nearly the same as before (7.7h).
I can try with a larger basis set and see what happens.
I can try with a larger basis set and see what happens.
Yes that would probably be good. Let me know what results you get.
Needs a bibtex_bibfiles = ['ref.bib', 'pub.bib']
in docs/conf.py
to build the docs with the newest sphinx version... I've added this in #119.
Yes I know :+1:. I had the change locally, but did not push, because I saw you fixed this elsewhere.
I think we should pin libtensorlight >= 3.0.0
for conda builds because the changes here won't be backward-compatible.
I'll not bother adding docs for this ... it's not super useful anyway, so I'd rather not raise too high hopes.
Agreed. Feel free to merge it and bump to 0.15.8 once the CI is done.
Any clue why the MacOS tests fail?
The failure while downloading the json file with HTTP Error 403 is more or less random, also happening all the time in my other PR đ
Hmm ... that's a bit annoying.
The problem is that I cannot reproduce it locally on macOS, so it's hard to find a fix âšī¸ Maybe bump the number of retries to 10?
This biggest problem that I can see with this issue is the deployment pipeline for macOS... đ
Let's see with 10 tries. If that does not help, I'm not sure what we can do.
I think one reason is that the GitHub API request is without an access token, and it could be that GitHub blocks our requests because all the macOS VMs have the same IP đĻ (result from random web search). Maybe we could implement a fallback option to directly download a specific version release without the API call?
Well ... that would be quite messy. Maybe a better idea is to introduce an environment variable with the url, that we would just set in the case of GHA?
Well ... that would be quite messy. Maybe a better idea is to introduce an environment variable with the url, that we would just set in the case of GHA?
Like LT_DOWNLOAD_LINK
or sth of the sort?
ADCC_LIBTENSOR_DOWNLOAD_URL I'd say. Should be explicit ... because it's not that people will use that on a regular basis.
Yes, perfect.
Thanks for making the change! If it works, we need to add the same to publish.yml
I think.
I don't think so because it uses conda to get libtensor, no?
I don't think so because it uses conda to get libtensor, no?
The conda
deploy pipeline yes, but not the pip
one.
but that uses ubuntu
but that uses ubuntu
Oh yes, of course đ
Damn now coveralls
upload fails on all pipelines đ¤¯
coveralls had some security issues a while back, perhaps we need to update to a newer version of the GHA?
We currently do a manual install of coveralls
+ upload, pulling the most recent version from pip
... Maybe we need to switch to the GHA for it to work?
@maxscheurer see lemurheavy/coveralls-public Issue 1543 ... we can only wait.
Drat... want to merge anyways?
Yes ok, let's do it.
Needs the changes from https://github.com/adc-connect/libtensor/pull/9.
Progress
@maxscheurer Feel free to play with it with a nice example. Libxm seems to write everything to disk and prefetch as it goes. That means that the memory requirement is super small, but also that it's only worth it if you have a fast disk and a calculation that no longer fits into your available RAM. But in that sense it closes a gap that adcc still has.