edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
67 stars 12 forks source link

Use hatch, hatch-vcs; drop setuptools, versioneer. #145

Closed danielballan closed 11 months ago

danielballan commented 12 months ago

This is part of the way there. Some warnings needs to be resolved. I haven't seen these before.

$ python -m build .
* Creating virtualenv isolated environment...
* Installing packages in isolated environment... (hatch-vcs, hatchling)
* Getting build dependencies for sdist...
* Building sdist...
* Building wheel from sdist
* Creating virtualenv isolated environment...
* Installing packages in isolated environment... (hatch-vcs, hatchling)
* Getting build dependencies for wheel...
* Building wheel...
/tmp/build-env-zr82pp42/lib/python3.11/site-packages/setuptools_scm/git.py:308: UserWarning: git archive did not support describe output
  warnings.warn("git archive did not support describe output")
/tmp/build-env-zr82pp42/lib/python3.11/site-packages/setuptools_scm/git.py:327: UserWarning: unprocessed git archival found (no export subst applied)
  warnings.warn("unprocessed git archival found (no export subst applied)")
danielballan commented 12 months ago

Ref: https://learn.scientific-python.org/development/guides/packaging-simple/#versioning

danielballan commented 12 months ago

Both warnings are coming from setuptools_scm which should not be involved at all here.

$ git grep setuptools
$ git grep scm
wayback/tests/cassettes/test_get_memento_with_mode.yaml:        7w5WeXtQfkKRkGAp0pGfR1RcdwJ/Icvl2yF224gXjAIxvhHbqp1Jscm4q3p14g1BsPXNsf5ULihF
wayback/tests/cassettes/test_get_memento_with_redirect_in_view_mode.yaml:        sveDbKr6kPX2ckJ3W5SDscmkXJxys6ze1fbXSulgQJfqptwDsjuv8fX3JDbOZ+dBDk99KXZJwqxG
wayback/tests/cassettes/test_get_memento_with_redirect_in_view_mode.yaml:        UgTAFn1wQkiYh3FhdMbW6UFdbsFxaFJscmV5MTeT+cTJdSW5KCVN10boqBiDcX4r5+yqJQqyADkw
Mr0grog commented 12 months ago

I’m new to build, but is it correct that it’s making a second isolated environment and building a wheel even though it already built an sdist and a wheel?

It looks like the warnings are coming from the second build, and seems strange (without knowing any better) for that second build to even be happening. I noticed that if I just run python -m build --wheel . or python -m build --sdist there are no warnings.

Mr0grog commented 12 months ago

Well, this seems to happen for me on a more-or-less blank repo (e.g. what you have if you follow the PyPA tutorial), so I think it may be an intrinsic issue with the current combination of build, hatchling, and setuptools-scm. I’m definitely not familiar enough with the details to diagnose the problem well, but based on skimming the source:

  1. setuptools-scm looks like it walks up the directory tree looking for possible sources of data, one of which is .git_archival.txt. If one doesn’t work, it keeps going.
  2. It expects to be reading the version of git_archival.txt that has real values substituted in, i.e. the version that would be in an actual archive produce with git archive <commit-ish>.
  3. If it sees what looks like an un-substituted describe directive, it warns "git archive did not support describe output". Then if it looks like the whole file hasn’t been processed at all (i.e. it’s what’s in the git repo rather than in an archive) it warns "unprocessed git archival found (no export subst applied)", making the first warning a little redundant and confusing. This appears to be what we’re seeing.
  4. If you are looking at a live git repo, not an archive, it seems like this is probably fine, and is maybe expected operation — it warns, then continues on and eventually realizes it can read the tags directly from the .git database or by calling git or whatever. Possible other evidence that this is normal is that they set pytest to ignore this warning: https://github.com/pypa/setuptools_scm/blob/83d5e1479970a3ea559fe11553e9350ae694e91c/pyproject.toml#L120-L124

It seems like the above is how setuptools-scm has behaved since v7. Pinning setuptools-scm==6.4.2 (latest v6.x) in [build-system] seems to work and not show the error, but if this is expected, it's probably better to live with the error?

Alternatively, maybe build or hatchling is intending to work against an actual git archive instead of the working directory, but is somehow winding up on the actual working directory, or a copy of it instead?

Mr0grog commented 12 months ago

I’m also noticing setuptools_scm excludes the file in their own project’s MANIFEST.in: https://github.com/pypa/setuptools_scm/blob/83d5e1479970a3ea559fe11553e9350ae694e91c/MANIFEST.in#L4

Mr0grog commented 12 months ago

Also, they put one that does get included in src/setuptools_scm. If we do the same, there’s no warning. (Of course, it might not find it where it’s looking in an actual archive then, either? Seems weird.)

Anyway, I’m done for the day. Hopefully some of the above will be useful for you.

danielballan commented 12 months ago

I’m new to build, but is it correct that it’s making a second isolated environment and building a wheel even though it already built an sdist and a wheel?

Yes. From python -m build --help:

By default, a source distribution (sdist) is built from {srcdir}
and a binary distribution (wheel) is built from the sdist.
This is recommended as it will ensure the sdist can be used
to build wheels.

Pass -s/--sdist and/or -w/--wheel to build a specific distribution.
If you do this, the default behavior will be disabled, and all
artifacts will be built from {srcdir} (even if you combine
-w/--wheel with -s/--sdist, the wheel will be built from {srcdir}).

If you are looking at a live git repo, not an archive, it seems like this is probably fine, and is maybe expected operation.

This possibility occurred to me. Maybe this is fine....

danielballan commented 12 months ago

CI failures are related to cache key:

error computing cache key: template: cacheKey:1:35: executing "cacheKey" at <checksum "requirements.txt">: error calling checksum: open /home/circleci/wayback/requirements.txt: no such file or directory

and version not being set:

wayback/__init__.py:1: in <module>
    from ._version import __version__, __version_tuple__  # noqa: F401
E   ImportError: cannot import name '__version__'
=========================== short test summary info ============================
ERROR  - ImportError: cannot import name '__version__'

Locally, the version is being set as expected:

>>> import wayback
>>> wayback.__version__
'0.4.5.dev4+ga150311'
Mr0grog commented 12 months ago

I updated the cache key and build script in .circleci/config.yml, which mostly fixed all the CI failures. The one remaining thing is that check-wheel-contents is failing on some VCR cassette files that are duplicates:

Checking dist/wayback-0.4.5.dev7+g289355e.d20231203-py3-none-any.whl: PASSED
Checking dist/wayback-0.4.5.dev7+g289355e.d20231203.tar.gz: PASSED
dist/wayback-0.4.5.dev7+g289355e.d20231203-py3-none-any.whl: W002: Wheel contains duplicate files:
  wayback/tests/cassettes/test_get_memento_handles_non_utc_datetime.yaml
  wayback/tests/cassettes/test_get_memento_with_string_datetime.yaml

I think we didn’t have that error before because we excluded test files with:

[tool.setuptools.packages.find]
where = ["."]
exclude = ["docs", "tests"]

So this either needs the equivalent config for Hatchling or config for check-wheel-contents to skip the cassette files.

I think it’s probably better to continue not including test files, but I don’t have especially strong feelings about it. Either approach is probably OK if you have strong feelings the other way.

Mr0grog commented 12 months ago

I’m new to build, but is it correct that it’s making a second isolated environment and building a wheel even though it already built an sdist and a wheel?

Yes. From python -m build --help:

Yeah, I had read that, and was reacting to the text:

By default, a source distribution (sdist) is built from {srcdir} and a binary distribution (wheel) is built from the sdist.

While the log output made it look like it was building once from the sdist and then again from source:

* Building wheel from sdist
...
* Building wheel...

But I just started digging into the warning on a simplified repo using setuptools instead of hatchling, and setuptools does a lot more logging — I can see now that the * Building wheel... log line is part of the * Building wheel from sdist code path, not a second, separate build, which is what I had thought it was before.

Mr0grog commented 12 months ago

That also makes it clearer what’s causing the warning, and why we only see it when doing build . and not during build --sdist or build --wheel:

  1. Build builds an sdist from the working directory. setuptools-scm finds the version from git — no problem, no warnings.

  2. To build the wheel, build unpacks the sdist into a temp directory, and then builds a wheel. When it tiggers setuptools-scm on the unpacked sdist, it’s not in a git checkout anymore (so can’t get info from git) so it looks at .git_archival.txt. But since the sdist was not packed up by doing git archive, that file doesn’t have any git metadata in it, and so setuptools-scm logs the warning.

  3. Running python -m build --wheel doesn’t have this problem, because that builds the wheel from the working git directory instead of from the sdist.

Anyway, this makes sense now, although it is kind of a crappy situation. Do you know if there’s a way to have build (or hatchling?) run the right git machinery to get the right values substituted into .git_archival.txt when building the sdist? It seems like that’s the right fix here, since you’d theoretically want that file correctly populated in the sdist as if it were an actual git archive.

Mr0grog commented 12 months ago

Or alternatively: maybe the right way to do this is to build with:

# From the root of the repo:
git archive --output ../build_dir/source_archive.tar <tag-to-build>
cd ../build_dir
tar -xf source_archive.tar
rm source_archive.tar
python -m build .

That way you are building the sdist with a properly archived/archive-ready file tree.

Mr0grog commented 12 months ago

Re: “exclude the file from builds,” I tripped over this pyproject.toml snippet today while reading about something else, and I think it’s what we’d want to use here:

[tool.hatch.build.targets.sdist]
exclude = [".git_archival.txt"]

(The same article also suggests removing test files from wheels but not sdists, which I hadn’t thought of before. Interesting.)

danielballan commented 12 months ago

Let me a little digging around some other projects and the behavior of their builds. I wonder if (i) is better, though I agree you have found the right implementation of (ii) if we go that way.

someday I’d like to script the whole release process in CI, but I think that’s out of scope here

Happy to tackle this next.

I usually include the tests. The “web deployment” side of Python, interested in small containers and fast CD, pushes to keep the tests out, while the “science” side of Python, where end users also sometimes dip into light development, almost always bundles them.

The sdist/wheel breakdown is a reasonable compromise.

I am on international travel for work; I left a note to resume this Monday when I am back at my keyboard.

Mr0grog commented 11 months ago

That all makes sense to me!

I am on international travel for work; I left a note to resume this Monday when I am back at my keyboard.

Sounds good. I’m hoping to include this in a v0.5.0 release before the end of the year, but otherwise there’s no rush. (Besides this I want to do some final rate limiting cleanup for #137 and to finally get thread safety by dropping requests, but we’ll see; I may be being to hopeful there.)

someday I’d like to script the whole release process in CI, but I think that’s out of scope here

Happy to tackle this next.

👍

danielballan commented 11 months ago

Reading a bit, I agree that excluding .git_archival.txt from the bdist is the best way to resolve these warnings. Commits added.

Googling the current error in CircleCI, I found a post from one @Mr0grog: https://discuss.circleci.com/t/upgrading-to-the-convenience-image-breaks-everything/42746

Do we need to further adjust our cache key here?

Mr0grog commented 11 months ago

Googling the current error in CircleCI, I found a post from one @Mr0grog: https://discuss.circleci.com/t/upgrading-to-the-convenience-image-breaks-everything/42746

Do we need to further adjust our cache key here?

🤣 Yep, I think that’s probably all you need to do.

danielballan commented 11 months ago

Last issue:

dist/wayback-0.4.5.dev12+gcc638d8-py3-none-any.whl: W002: Wheel contains duplicate files:
  wayback/tests/cassettes/test_get_memento_handles_non_utc_datetime.yaml
  wayback/tests/cassettes/test_get_memento_with_string_datetime.yaml
danielballan commented 11 months ago

Indeed these files have the same contents, but I think we should not assume that they always will forever, so the right thing here to make an exemption in the check:

$ md5sum wayback/tests/cassettes/test_get_memento_handles_non_utc_datetime.yaml wayback/tests/cassettes/test_get_memento_with_string_datetime.yaml
1ede1c2e263c5b56731d1c29156f9784  wayback/tests/cassettes/test_get_memento_handles_non_utc_datetime.yaml
1ede1c2e263c5b56731d1c29156f9784  wayback/tests/cassettes/test_get_memento_with_string_datetime.yaml
Mr0grog commented 11 months ago

Huh, I would have thought that would be solved by https://github.com/edgi-govdata-archiving/wayback/commit/7be22d4b6c768195f2b9913267d917bcf1082119. Is that only addressing Python files, or does it need to be wayback/tests/**/*?

Mr0grog commented 11 months ago

Oh, no, I think it needs to be the wheel target:

[tool.hatch.build.targets.wheel]
danielballan commented 11 months ago

Ah, indeed. Pushed.

Mr0grog commented 11 months ago

Oh, I also wanted to ask: what pushed you towards excluding .git_archival.txt rather than building the sdist from an archive?

I agree that excluding .git_archival.txt from the bdist is the best way to resolve these warnings.

danielballan commented 11 months ago

Your argument made sense, I just look around a bit at some projects and confirmed that I could not find any examples that could convince me otherwise.

Mr0grog commented 11 months ago

@danielballan would a good quick followup here be moving /wayback to /src/wayback?