apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

redo internals of State #185

Open mr-martian opened 4 months ago

mr-martian commented 4 months ago

This PR implements and alternative to State that builds a graph of output paths rather than copying a collection of vectors. The result is that step() is faster because it's not doing any allocations and filterFinals() is slower because it has to collect the path rather than just iterating over an array. Initial experimentation suggests this is an overall speedup, but the constants in resuable_state.h should probably be tuned empirically.

I currently have it as a separate class rather than directly modifying State in case there ends up being any ABI breakage or if this turns out to be better for some processes but not others.

mr-martian commented 3 months ago

Having now implemented this for analysis, I get the following results:

metric old -oci new -oci old -eng new -eng
input words 95K 95K 68K 68K
memory 440 MB 439 MB 34 MB 33 MB
time 2.1 s 1.7 s 0.6 s 0.4 s
speedup 19% 33%
mr-martian commented 3 months ago

Though the tests pass, this might have unexpected effects in various places because analyses with the same weight will now be displayed in alphabetical order rather than in an unspecified order strongly influenced by the order they appear in the dictionary. I can probably restore the older order, but I'm not sure whether that matters.

mr-martian commented 3 months ago

Restoring the old ordering is just

diff --git a/lttoolbox/reusable_state.cc b/lttoolbox/reusable_state.cc
index aa55c95..1126d71 100644
--- a/lttoolbox/reusable_state.cc
+++ b/lttoolbox/reusable_state.cc
@@ -241,11 +241,17 @@ void ReusableState::extract(size_t pos, UString& result, double& weight,
   }
 }

+bool comp_pair(const std::pair<double, UString>& a,
+               const std::pair<double, UString>& b)
+{
+  return a.first < b.first;
+}
+
 void NFinals(std::vector<std::pair<double, UString>>& results,
              size_t maxAnalyses, size_t maxWeightClasses)
 {
   if (results.empty()) return;
-  sort(results.begin(), results.end());
+  sort(results.begin(), results.end(), comp_pair);
   if (maxAnalyses < results.size()) {
     results.erase(results.begin()+maxAnalyses, results.end());
   }

which has a negligible impact on the runtime. I leave the question open to bikeshedding.

mr-martian commented 3 months ago

Having tried block sizes of 1<<N for N=1..8,16, my tentative conclusion is that changing it doesn't meaningfully alter the runtime numbers listed above.

mr-martian commented 2 months ago

I restored the original output ordering. I haven't applied it everywhere it could be applied, though that's largely because biltrans is structured in a way that makes it awkward to incorporate and the other modes have no tests.

mr-martian commented 2 months ago

As it stands now, this PR has subtly different capitalization output for generation, so I'll revert those parts.

--- oci-fra.prev.txt    2024-05-09 10:18:41.005392821 -0400
+++ oci-fra.cur.txt 2024-09-05 08:45:12.849808793 -0400
@@ -216 +216 @@
 00218. Les 44 objets d'art en litige sont trois caisses *sepulcralas du [-XVe-] {+XVE+} siècle et quatre fragments d'un retable d'*alabastre baroque, 11 fragments de ce retable, un banc et diverses peintures et sculptures des [-XVIe-] {+XVIE+} et [-XVIIIe-] {+XVIIIE+} siècles.
@@ -775 +775 @@
 00777. Des archéologues égyptiens annoncèrent mercredi dernier, 19 avril, la découverte d'un tombeau important à *Luxor de l'époque de la [-XVIIIe Dynastie,-] {+XVIIIE DYNASTIE,+} entre les années 1500 et 1295 avant le Christ, et qu'il appartenait probablement à un éminent juge de l'époque. 
@@ -1664 +1664 @@
 01666. L'aventure de Hicks et Bullard suit les passages des récits enregistrés au [-XVIIe-] {+XVIIE+} siècle sur l'arrivée de quelques *navigaires inconnus en Écosse.
@@ -2111 +2111 @@
 02120. *Vaduda, *ce se *pareish, en Provence, la tradition des petits personnages en argile peinte —les célèbres “santons”— que s'*estableish à la fin du [-XVIIIe-] {+XVIIIE+} siècle avec le fabricant marseillais J.-L.
@@ -2458 +2458 @@
 02467. La ville fut redécouverte par des voyageurs autour des [-XVIIe-] {+XVIIE+} et [-XVIIIe-] {+XVIIIE+} siècles.
@@ -2742 +2742 @@
 02751. À la capitale du Népal, Katmandou, d'édifices se sont éreintés. Entre les constructions affectées il y a la fameuse Tor de *Dharahara, du [-XIXe-] {+XIXE+} siècle, reconnue par son importance architecturale pour l'UNESCO: là-bas une cinquantaine de personnes sont restées bloquées sous les *tarcums. 
@@ -3556 +3556 @@
 03565. Jamais il ne se dénoncera assez l'*ignomínia du dernier génocide du [-XXe-] {+XXE+} siècle.
@@ -3889 +3889 @@
 03898. *Vaduda, *ce *disen, en Provence, la tradition des petits personnages en argile peinte —les célèbres santons— que s'*estableish à la fin du [-XVIIIe-] {+XVIIIE+} siècle avec le fabricant marseillais J.L.

--- fra-oci.prev.txt    2024-05-09 10:19:25.818425542 -0400
+++ fra-oci.cur.txt 2024-09-05 08:46:23.948983165 -0400
@@ -2759 +2759 @@
 02759. Lo març de 1838 un tractat es passat entre la societat e [-sénhers-] {+SÉNHERS+}
@@ -3447 +3447 @@
 03447. Lo 7 de junh de 2006, al sèti de BNP [-Paribas,-] {+PARIBAS,+} recep lo prèmi *Vernimmen, distincion creada en 1998 en omenatge a Pierre *Vernimmen, professor e banquièr francés famós per sas òbras sus la finança d'entrepresa.
mr-martian commented 2 months ago

turns out I actually just forgot to check the dirty bit

mr-martian commented 2 months ago

Unfortunately, the memory usage on this explodes for nno-nob, probably due to regexes (see #167).