Closed tom-p-reichel closed 1 year ago
My plan for testing was to pick two real commits in real projects and compare the alignment to what I expect it to be, but reading the test_build_cache
, I actually still don't know how to populate a CoqProjectBuildCache with a project so that I can get two ProjectCommitData entries. Is it worth it to move a project through the whole pipeline or should I just build the ProjectCommitData objects directly?
Sorry, just noticed your message. The script at https://github.com/a-gardner1/coq-pearls/blob/main/scripts/data/extract_cache.py can be invoked with a subset of projects (--project-names
) and limited number of commits (--max-num-commits
). A good candidate project is probably lambda
or float
, which we use in some of our unit tests. However, I do not think a time-efficient unit test for extraction could be made for alignment over two whole commits without having some ProjectCommitData
precomputed.
Instead, I would follow the recipe of extract_cache_profile.py
using the files_to_use
argument and limit the extraction to a small subset of files. For an actual unit test, I would also have the ProjectCommitData
precomputed and serialized in a file alongside the test module. That way the test is independent of any quirks in the actual extraction logic and doesn't incur the associated overhead.
I got this error after trying to run the extraction script invoked as such:
python scripts/data/extract_cache.py --project-names float --max-num-commits 30 --root "/home/tpr/work/ra/PRISM"
where I set up the directory format so that the dataset directory in the repo aligns with the default path.
/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../prism/da
ta/commit_map.py:376: UserWarning: No results found for float
warnings.warn(f"No results found for {p.name}")
<class 'prism.language.gallina.exception.SexpAnalyzingException'> encountered in pr
oject float:
Traceback (most recent call last):
File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../
prism/language/gallina/analyze.py", line 894, in analyze_vernac
control_flag = ControlFlag(vernac_control[0].content)
File "/usr/lib/python3.8/enum.py", line 339, in __call__
return cls.__new__(cls, value)
File "/usr/lib/python3.8/enum.py", line 663, in __new__
raise ve_exc
ValueError: 'loc' is not a valid ControlFlag
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 104, in _project_commit_fmap
result = commit_fmap(project, commit, result)
File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 324, in _commit_fmap_and_update
result = commit_fmap(project, commit_sha, result)
File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 1150, in extract_cache_func
extract_cache(
File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 653, in extract_cache
extract_cache_new(
File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 720, in extract_cache_new
command_data = extract_vernac_commands(project, files_to_use)
File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 563, in extract_vernac_commands
command_data[filename] = _extract_vernac_commands(
File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 398, in _extract_vernac_commands
vernac = SexpAnalyzer.analyze_vernac(sentence.ast)
File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/language/gallina/analyze.py", line 896, in analyze_vernac
raise SexpAnalyzingException(sexp) from e
prism.language.gallina.exception.SexpAnalyzingException:
in sexp: (loc
(
(
(fname ToplevelInput)
(line_nb 1)
(bol_pos 0)
(line_nb_last 1)
(bol_pos_last 0)
(bp 0)
(ep 31))))
Done
There's a warning about not being able to find float, but then what was it extracting?
I got this error after trying to run the extraction script invoked as such:
python scripts/data/extract_cache.py --project-names float --max-num-commits 30 --root "/home/tpr/work/ra/PRISM"
where I set up the directory format so that the dataset directory in the repo aligns with the default path.
/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../prism/da ta/commit_map.py:376: UserWarning: No results found for float warnings.warn(f"No results found for {p.name}") <class 'prism.language.gallina.exception.SexpAnalyzingException'> encountered in pr oject float: Traceback (most recent call last): File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../ prism/language/gallina/analyze.py", line 894, in analyze_vernac control_flag = ControlFlag(vernac_control[0].content) File "/usr/lib/python3.8/enum.py", line 339, in __call__ return cls.__new__(cls, value) File "/usr/lib/python3.8/enum.py", line 663, in __new__ raise ve_exc ValueError: 'loc' is not a valid ControlFlag The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 104, in _project_commit_fmap result = commit_fmap(project, commit, result) File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 324, in _commit_fmap_and_update result = commit_fmap(project, commit_sha, result) File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 1150, in extract_cache_func extract_cache( File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 653, in extract_cache extract_cache_new( File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 720, in extract_cache_new command_data = extract_vernac_commands(project, files_to_use) File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 563, in extract_vernac_commands command_data[filename] = _extract_vernac_commands( File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 398, in _extract_vernac_commands vernac = SexpAnalyzer.analyze_vernac(sentence.ast) File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/language/gallina/analyze.py", line 896, in analyze_vernac raise SexpAnalyzingException(sexp) from e prism.language.gallina.exception.SexpAnalyzingException: in sexp: (loc ( ( (fname ToplevelInput) (line_nb 1) (bol_pos 0) (line_nb_last 1) (bol_pos_last 0) (bp 0) (ep 31)))) Done
There's a warning about not being able to find float, but then what was it extracting?
The warning just means that nothing was extracted. That is a new error... at first glance it implies that the entire AST for some command was just a location. We will investigate it later today
I've got an MR waiting to merge that fixed a couple of bugs, though I was unable to reproduce your error. One bug was pretty significant and might have caused yours: the switch was not being passed to SerAPI
. I was able to produce the cache for one commit of float
, which came out to 313 MB (12 MB compressed).
I think the size of the cache can be reduced (e.g., by storing diffs of proof goals versus complete sets for each sentence), but not sure to what extent.
The changes have been merged. Can you try again? And if you hit the same error, see if you can determine which file and commit caused it. I recommend adding --force-serial
for debugging.
That looks like it fixed it. It's making progress now.
Though, it used up all 32 GBs of RAM on my computer. Then I gave it 10 GB swap and it still got OOM killed a while later, just as it was finishing up the first commit. I probably have to do as extract_cache_profile
does and pick a couple files to align.
Though, it used up all 32 GBs of RAM on my computer. Then I gave it 10 GB swap and it still got OOM killed a while later, just as it was finishing up the first commit. I probably have to do as
extract_cache_profile
does and pick a couple files to align.
Oof, looks like some more profiling is in order. I'm sure we can reduce the memory usage.
I have pushed a fix for a memory leak due to some missing Py_DecRef
s in the C extension for sexp parsing. I've confirmed via memory profiling that there are no more leaks.
I also have a couple of changes in mind to SerAPI
and the cached data representations that ought to significantly reduce both memory usage and time to completion (if the code is currently structured the way I think it is in my head, then I expect to nearly double the overall speed).
Here's another fun one.
Use a default-commit of e5f4b46405b24efb4ec9956e486df2b87adb10b2
for verdi-chord.
python scripts/data/extract_cache.py --project-names verdi-chord --max-num-commits 1 --root "/home/tpr/work/ra/PRISM" --n-build-workers 15
File "/usr/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 117, in _project_commit_fmap_
return _project_commit_fmap(*args)
File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 99, in _project_commit_fmap
for commit in pbar:
File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 753, in cache_extract_commit_iterator
iterator = ChangedCoqCommitIterator(project, starting_commit_sha)
File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../prism/project/repo.py", line 82, in __init__
self._repo_initial_head = repo.commit(commit_sha)
File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/repo/base.py", line 590, in commit
return self.rev_parse(str(rev) + "^0")
File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/repo/fun.py", line 249, in rev_parse
obj = cast(Commit_ish, name_to_object(repo, rev[:start]))
File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/repo/fun.py", line 186, in name_to_object
return Object.new_from_sha(repo, hex_to_bin(hexsha))
File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/objects/base.py", line 94, in new_from_sha
oinfo = repo.odb.info(sha1)
File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/db.py", line 40, in info
hexsha, typename, size = self._git.get_object_header(bin_to_hex(binsha))
File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/cmd.py", line 1338, in get_object_header
return self.__get_object_header(cmd, ref)
File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/cmd.py", line 1325, in __get_object_header
return self._parse_object_header(cmd.stdout.readline())
File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/cmd.py", line 1286, in _parse_object_header
raise ValueError("SHA %s could not be resolved, git returned: %r" % (tokens[0], header_line.strip()))
ValueError: SHA b'e5f4b46405b24efb4ec9956e486df2b87adb10b2' could not be resolved, git returned: b'e5f4b46405b24efb4ec9956e486df2b87adb10b2 missing'
I thought this was the 2019 filter, but I edited that to 2010 and re-pip-installed, and this still happened.
I can manually clone the verdi repo and checkout that hash without issues, but this still happens. This error didn't occur before I changed the default-commit.
This could be this upstream issue or related to this upstream issue: https://github.com/gitpython-developers/GitPython/issues/1016
In the meantime I might just copy paste some definitions from some files of interest for a synthetic test.
OK, added a test by picking an interesting pair of commits from a project. I did some hacky stuff to include definitions in the alignment because those were deleted/added between the commits I picked. Made sure to include the license of the repository I took the examples from. Still needs a test of multiple files.
With this new test, documentation, and accompanying bugfix for the new test, I think this is now in a reasonable state for someone to take a look at.
Here's another fun one.
Use a default-commit of
e5f4b46405b24efb4ec9956e486df2b87adb10b2
for verdi-chord.
python scripts/data/extract_cache.py --project-names verdi-chord --max-num-commits 1 --root "/home/tpr/work/ra/PRISM" --n-build-workers 15
[Clipped]
I thought this was the 2019 filter, but I edited that to 2010 and re-pip-installed, and this still happened.
I can manually clone the verdi repo and checkout that hash without issues, but this still happens. This error didn't occur before I changed the default-commit.
This could be this upstream issue or related to this upstream issue: gitpython-developers/GitPython#1016
In the meantime I might just copy paste some definitions from some files of interest for a synthetic test.
@tom-p-reichel I'm not seeing a commit with that hash in the verdi-chord repository. https://github.com/DistributedComponents/verdi-chord/commit/e5f4b46405b24efb4ec9956e486df2b87adb10b2 gives me a 404. I thought it might be a blob hash in that repo, but it doesn't seem to be that either. Where did that hash come from?
Here's another fun one. Use a default-commit of
e5f4b46405b24efb4ec9956e486df2b87adb10b2
for verdi-chord.python scripts/data/extract_cache.py --project-names verdi-chord --max-num-commits 1 --root "/home/tpr/work/ra/PRISM" --n-build-workers 15
[Clipped]
I thought this was the 2019 filter, but I edited that to 2010 and re-pip-installed, and this still happened. I can manually clone the verdi repo and checkout that hash without issues, but this still happens. This error didn't occur before I changed the default-commit. This could be this upstream issue or related to this upstream issue: gitpython-developers/GitPython#1016 In the meantime I might just copy paste some definitions from some files of interest for a synthetic test.
@tom-p-reichel I'm not seeing a commit with that hash in the verdi-chord repository. DistributedComponents/verdi-chord@e5f4b46 gives me a 404. I thought it might be a blob hash in that repo, but it doesn't seem to be that either. Where did that hash come from?
Summary of offline conversation: we think the commit iterators might need to be modified to not follow chains of parents or children, as this hash likely arose as the parent of some commit that no longer exists after a rebase or other history-altering modification to the repository. Instead, we will just have to get the list of all commits in the repository up front, sort them by date, and iterate over the list.
we think the commit iterators might need to be modified to not follow chains of parents or children, as this hash likely arose as the parent of some commit that no longer exists after a rebase or other history-altering modification to the repository
Oh no! Don't do that! No, I think what happened is that there are actually two popular coq repos named "verdi"-- one by DistributedComponents and one by uwplse, and I had been resolving names to github repos using google with no problem up until this one, which was ambiguous-- my fault.
So that hash actually literally didn't exist in the repo, it existed in another coq project named "verdi", which I didn't expect to exist.
Implemented requested stub-- also found out that I was wrong about which edit distance I was using. Hamming appears to be for fixed length, I was using Levenshtein.
Got some non-trivial performance improvements by running alignment on numpy arrays of characters as ints instead of strings. Kind of thought numba would do that automatically.
FYI, I am going to replace usages of fast_edit_distance
with a third-party implementation of the Levenshtein distance that is considerably faster:
>>> from timeit import timeit
>>> import leven
>>> from prism.util.alignment import fast_edit_distance
>>> fast_edit_distance("flaw", "lawn", True)[0] # eliminate jit compilation from timeit
2.0
>>> timeit(lambda : leven.levenshtein("flaw", "lawn"))
0.2536887312307954
>>> timeit(lambda : fast_edit_distance("flaw", "lawn", True)[0])
31.69242916069925
The changes in this PR have been rebased and merged with the main branch.
Numba alignment stuff works, shimmed into existing tests for alignment.
I did some testing by hand on
file_alignment
but I need to add some rigorous testing onalign_commits
before this should be merged-- PR sent now for visibility.