Radiance-Technologies / prism

Utilities for creating and curating a dataset for automated proof repair in the Coq Proof Assistant.
GNU Lesser General Public License v3.0
5 stars 2 forks source link

Accelerated Alignment Code + Adapter for ProjectCommitData #32

Closed tom-p-reichel closed 1 year ago

tom-p-reichel commented 2 years ago

Numba alignment stuff works, shimmed into existing tests for alignment.

I did some testing by hand on file_alignment but I need to add some rigorous testing on align_commits before this should be merged-- PR sent now for visibility.

tom-p-reichel commented 2 years ago

My plan for testing was to pick two real commits in real projects and compare the alignment to what I expect it to be, but reading the test_build_cache, I actually still don't know how to populate a CoqProjectBuildCache with a project so that I can get two ProjectCommitData entries. Is it worth it to move a project through the whole pipeline or should I just build the ProjectCommitData objects directly?

a-gardner1 commented 2 years ago

Sorry, just noticed your message. The script at https://github.com/a-gardner1/coq-pearls/blob/main/scripts/data/extract_cache.py can be invoked with a subset of projects (--project-names) and limited number of commits (--max-num-commits). A good candidate project is probably lambda or float, which we use in some of our unit tests. However, I do not think a time-efficient unit test for extraction could be made for alignment over two whole commits without having some ProjectCommitData precomputed.

Instead, I would follow the recipe of extract_cache_profile.py using the files_to_use argument and limit the extraction to a small subset of files. For an actual unit test, I would also have the ProjectCommitData precomputed and serialized in a file alongside the test module. That way the test is independent of any quirks in the actual extraction logic and doesn't incur the associated overhead.

tom-p-reichel commented 2 years ago

I got this error after trying to run the extraction script invoked as such: python scripts/data/extract_cache.py --project-names float --max-num-commits 30 --root "/home/tpr/work/ra/PRISM"

where I set up the directory format so that the dataset directory in the repo aligns with the default path.

/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../prism/da
ta/commit_map.py:376: UserWarning: No results found for float
  warnings.warn(f"No results found for {p.name}")
<class 'prism.language.gallina.exception.SexpAnalyzingException'> encountered in pr
oject float:
Traceback (most recent call last):
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../
prism/language/gallina/analyze.py", line 894, in analyze_vernac
    control_flag = ControlFlag(vernac_control[0].content)
  File "/usr/lib/python3.8/enum.py", line 339, in __call__
    return cls.__new__(cls, value)
  File "/usr/lib/python3.8/enum.py", line 663, in __new__
    raise ve_exc
ValueError: 'loc' is not a valid ControlFlag

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 104, in _project_commit_fmap
    result = commit_fmap(project, commit, result)
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 324, in _commit_fmap_and_update
    result = commit_fmap(project, commit_sha, result)
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 1150, in extract_cache_func
    extract_cache(
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 653, in extract_cache
    extract_cache_new(
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 720, in extract_cache_new
    command_data = extract_vernac_commands(project, files_to_use)
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 563, in extract_vernac_commands
    command_data[filename] = _extract_vernac_commands(
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 398, in _extract_vernac_commands
    vernac = SexpAnalyzer.analyze_vernac(sentence.ast)
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/language/gallina/analyze.py", line 896, in analyze_vernac
    raise SexpAnalyzingException(sexp) from e
prism.language.gallina.exception.SexpAnalyzingException: 
in sexp: (loc 
  (
    (
      (fname ToplevelInput) 
      (line_nb 1) 
      (bol_pos 0) 
      (line_nb_last 1) 
      (bol_pos_last 0) 
      (bp 0) 
      (ep 31))))

Done

There's a warning about not being able to find float, but then what was it extracting?

a-gardner1 commented 2 years ago

I got this error after trying to run the extraction script invoked as such:

python scripts/data/extract_cache.py --project-names float --max-num-commits 30 --root "/home/tpr/work/ra/PRISM"

where I set up the directory format so that the dataset directory in the repo aligns with the default path.


/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../prism/da

ta/commit_map.py:376: UserWarning: No results found for float

  warnings.warn(f"No results found for {p.name}")

<class 'prism.language.gallina.exception.SexpAnalyzingException'> encountered in pr

oject float:

Traceback (most recent call last):

  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../

prism/language/gallina/analyze.py", line 894, in analyze_vernac

    control_flag = ControlFlag(vernac_control[0].content)

  File "/usr/lib/python3.8/enum.py", line 339, in __call__

    return cls.__new__(cls, value)

  File "/usr/lib/python3.8/enum.py", line 663, in __new__

    raise ve_exc

ValueError: 'loc' is not a valid ControlFlag

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 104, in _project_commit_fmap

    result = commit_fmap(project, commit, result)

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 324, in _commit_fmap_and_update

    result = commit_fmap(project, commit_sha, result)

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 1150, in extract_cache_func

    extract_cache(

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 653, in extract_cache

    extract_cache_new(

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 720, in extract_cache_new

    command_data = extract_vernac_commands(project, files_to_use)

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 563, in extract_vernac_commands

    command_data[filename] = _extract_vernac_commands(

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 398, in _extract_vernac_commands

    vernac = SexpAnalyzer.analyze_vernac(sentence.ast)

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/language/gallina/analyze.py", line 896, in analyze_vernac

    raise SexpAnalyzingException(sexp) from e

prism.language.gallina.exception.SexpAnalyzingException: 

in sexp: (loc 

  (

    (

      (fname ToplevelInput) 

      (line_nb 1) 

      (bol_pos 0) 

      (line_nb_last 1) 

      (bol_pos_last 0) 

      (bp 0) 

      (ep 31))))

Done

There's a warning about not being able to find float, but then what was it extracting?

The warning just means that nothing was extracted. That is a new error... at first glance it implies that the entire AST for some command was just a location. We will investigate it later today

a-gardner1 commented 2 years ago

I've got an MR waiting to merge that fixed a couple of bugs, though I was unable to reproduce your error. One bug was pretty significant and might have caused yours: the switch was not being passed to SerAPI. I was able to produce the cache for one commit of float, which came out to 313 MB (12 MB compressed).

I think the size of the cache can be reduced (e.g., by storing diffs of proof goals versus complete sets for each sentence), but not sure to what extent.

a-gardner1 commented 2 years ago

The changes have been merged. Can you try again? And if you hit the same error, see if you can determine which file and commit caused it. I recommend adding --force-serial for debugging.

tom-p-reichel commented 2 years ago

That looks like it fixed it. It's making progress now.

tom-p-reichel commented 2 years ago

Though, it used up all 32 GBs of RAM on my computer. Then I gave it 10 GB swap and it still got OOM killed a while later, just as it was finishing up the first commit. I probably have to do as extract_cache_profile does and pick a couple files to align.

a-gardner1 commented 2 years ago

Though, it used up all 32 GBs of RAM on my computer. Then I gave it 10 GB swap and it still got OOM killed a while later, just as it was finishing up the first commit. I probably have to do as extract_cache_profile does and pick a couple files to align.

Oof, looks like some more profiling is in order. I'm sure we can reduce the memory usage.

a-gardner1 commented 2 years ago

I have pushed a fix for a memory leak due to some missing Py_DecRefs in the C extension for sexp parsing. I've confirmed via memory profiling that there are no more leaks.

I also have a couple of changes in mind to SerAPI and the cached data representations that ought to significantly reduce both memory usage and time to completion (if the code is currently structured the way I think it is in my head, then I expect to nearly double the overall speed).

tom-p-reichel commented 2 years ago

Here's another fun one.

Use a default-commit of e5f4b46405b24efb4ec9956e486df2b87adb10b2 for verdi-chord.

python scripts/data/extract_cache.py --project-names verdi-chord --max-num-commits 1 --root "/home/tpr/work/ra/PRISM" --n-build-workers 15

  File "/usr/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 117, in _project_commit_fmap_
    return _project_commit_fmap(*args)
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 99, in _project_commit_fmap
    for commit in pbar:
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 753, in cache_extract_commit_iterator
    iterator = ChangedCoqCommitIterator(project, starting_commit_sha)
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../prism/project/repo.py", line 82, in __init__
    self._repo_initial_head = repo.commit(commit_sha)
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/repo/base.py", line 590, in commit
    return self.rev_parse(str(rev) + "^0")
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/repo/fun.py", line 249, in rev_parse
    obj = cast(Commit_ish, name_to_object(repo, rev[:start]))
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/repo/fun.py", line 186, in name_to_object
    return Object.new_from_sha(repo, hex_to_bin(hexsha))
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/objects/base.py", line 94, in new_from_sha
    oinfo = repo.odb.info(sha1)
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/db.py", line 40, in info
    hexsha, typename, size = self._git.get_object_header(bin_to_hex(binsha))
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/cmd.py", line 1338, in get_object_header
    return self.__get_object_header(cmd, ref)
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/cmd.py", line 1325, in __get_object_header
    return self._parse_object_header(cmd.stdout.readline())
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/cmd.py", line 1286, in _parse_object_header
    raise ValueError("SHA %s could not be resolved, git returned: %r" % (tokens[0], header_line.strip()))
ValueError: SHA b'e5f4b46405b24efb4ec9956e486df2b87adb10b2' could not be resolved, git returned: b'e5f4b46405b24efb4ec9956e486df2b87adb10b2 missing'

I thought this was the 2019 filter, but I edited that to 2010 and re-pip-installed, and this still happened.

I can manually clone the verdi repo and checkout that hash without issues, but this still happens. This error didn't occur before I changed the default-commit.

This could be this upstream issue or related to this upstream issue: https://github.com/gitpython-developers/GitPython/issues/1016

In the meantime I might just copy paste some definitions from some files of interest for a synthetic test.

tom-p-reichel commented 2 years ago

OK, added a test by picking an interesting pair of commits from a project. I did some hacky stuff to include definitions in the alignment because those were deleted/added between the commits I picked. Made sure to include the license of the repository I took the examples from. Still needs a test of multiple files.

tom-p-reichel commented 2 years ago

With this new test, documentation, and accompanying bugfix for the new test, I think this is now in a reasonable state for someone to take a look at.

rwhender commented 2 years ago

Here's another fun one.

Use a default-commit of e5f4b46405b24efb4ec9956e486df2b87adb10b2 for verdi-chord.

python scripts/data/extract_cache.py --project-names verdi-chord --max-num-commits 1 --root "/home/tpr/work/ra/PRISM" --n-build-workers 15

[Clipped]

I thought this was the 2019 filter, but I edited that to 2010 and re-pip-installed, and this still happened.

I can manually clone the verdi repo and checkout that hash without issues, but this still happens. This error didn't occur before I changed the default-commit.

This could be this upstream issue or related to this upstream issue: gitpython-developers/GitPython#1016

In the meantime I might just copy paste some definitions from some files of interest for a synthetic test.

@tom-p-reichel I'm not seeing a commit with that hash in the verdi-chord repository. https://github.com/DistributedComponents/verdi-chord/commit/e5f4b46405b24efb4ec9956e486df2b87adb10b2 gives me a 404. I thought it might be a blob hash in that repo, but it doesn't seem to be that either. Where did that hash come from?

a-gardner1 commented 2 years ago

Here's another fun one. Use a default-commit of e5f4b46405b24efb4ec9956e486df2b87adb10b2 for verdi-chord. python scripts/data/extract_cache.py --project-names verdi-chord --max-num-commits 1 --root "/home/tpr/work/ra/PRISM" --n-build-workers 15

[Clipped]

I thought this was the 2019 filter, but I edited that to 2010 and re-pip-installed, and this still happened. I can manually clone the verdi repo and checkout that hash without issues, but this still happens. This error didn't occur before I changed the default-commit. This could be this upstream issue or related to this upstream issue: gitpython-developers/GitPython#1016 In the meantime I might just copy paste some definitions from some files of interest for a synthetic test.

@tom-p-reichel I'm not seeing a commit with that hash in the verdi-chord repository. DistributedComponents/verdi-chord@e5f4b46 gives me a 404. I thought it might be a blob hash in that repo, but it doesn't seem to be that either. Where did that hash come from?

Summary of offline conversation: we think the commit iterators might need to be modified to not follow chains of parents or children, as this hash likely arose as the parent of some commit that no longer exists after a rebase or other history-altering modification to the repository. Instead, we will just have to get the list of all commits in the repository up front, sort them by date, and iterate over the list.

tom-p-reichel commented 2 years ago

we think the commit iterators might need to be modified to not follow chains of parents or children, as this hash likely arose as the parent of some commit that no longer exists after a rebase or other history-altering modification to the repository

Oh no! Don't do that! No, I think what happened is that there are actually two popular coq repos named "verdi"-- one by DistributedComponents and one by uwplse, and I had been resolving names to github repos using google with no problem up until this one, which was ambiguous-- my fault.

So that hash actually literally didn't exist in the repo, it existed in another coq project named "verdi", which I didn't expect to exist.

tom-p-reichel commented 2 years ago

Implemented requested stub-- also found out that I was wrong about which edit distance I was using. Hamming appears to be for fixed length, I was using Levenshtein.

tom-p-reichel commented 2 years ago

Got some non-trivial performance improvements by running alignment on numpy arrays of characters as ints instead of strings. Kind of thought numba would do that automatically.

a-gardner1 commented 1 year ago

FYI, I am going to replace usages of fast_edit_distance with a third-party implementation of the Levenshtein distance that is considerably faster:

>>> from timeit import timeit
>>> import leven
>>> from prism.util.alignment import fast_edit_distance
>>> fast_edit_distance("flaw", "lawn", True)[0]  # eliminate jit compilation from timeit
2.0
>>> timeit(lambda : leven.levenshtein("flaw", "lawn"))
0.2536887312307954
>>> timeit(lambda : fast_edit_distance("flaw", "lawn", True)[0])
31.69242916069925
a-gardner1 commented 1 year ago

The changes in this PR have been rebased and merged with the main branch.