Genivia / ugrep

NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.6k stars 109 forks source link

ugrep.exe - Processing large files: RAM usage and speed. #153

Closed zh76internetru closed 2 years ago

zh76internetru commented 2 years ago

Hello. I'm here about a job ugrep.exe. I have a task: a template file of a million lines to search for something similar in another file among several million lines. ugrep.exe copes with this well, but there are wishes.

  1. It does not know how to work with such large files - it loads RAM very much. I had to divide the template into several files of 65,000 lines each, and the file being checked was also divided.

In terms of volume, 2 files turned out to be 2.5 MB each. But when working, they use 4 GB of RAM.

I had to write a bat file (as I could, I'm not a programmer).2. The processing speed depends on the processor frequency. At 2.5 GHz, the file is checked in 12 seconds, and at 3 GHz in 8 seconds. Also, time is spent on loading and unloading files into RAM, which greatly slows down the overall process of checking 2 databases.

The system operation looks like this: https://radikal.ru/lfp/a.radikal.ru/a27/2109/3b/10a041f4a62a.jpg/htm

Bat file: http://forum.ru-board.com/topic.cgi?forum=5&topic=0602&start=1166&limit=1&m=1#1

K1.txt this is a template made of pieces, 22.txt this is a checked file from chunks.

Wishes.

For more information, it is better to contact zh_76@internet.ru I am also ready to conduct testing on a free basis ugrep.exe. I really liked it.

P.S. I don't know how to contact the developer of ugrep v3 directly.3.7 Dr. Robert van Engelen

zh76internetru commented 2 years ago

Нашел решение с помощью https://github.com/BurntSushi/ripgrep/releases/tag/13.0.0 Все в разы быстрее и оперативку не ест.

genivia-inc commented 2 years ago

In terms of volume, 2 files turned out to be 2.5 MB each. But when working, they use 4 GB of RAM.

This is impossible. Files are not stored in RAM when searched. A sliding window over the file is used instead to keep memory usage low. So actual physical memory usage should not be high. Users reported that RAM usage is acceptable.

Users also independently reported that ugrep is faster than ripgrep for common searches, see e.g. https://github.com/Genivia/RE-flex/issues/91 "ugrep managed around 1850MB/s on my disk ... Ripgrep can get to 1700MB/s on my disk." Speed is related to memory usage, so I think ugrep is doing OK. But if there is a problem then I will fix it.

zh76internetru commented 2 years ago

But I really have so that 1 process used all the memory. I attached a screenshot from the link. https://radikal.ru/lfp/a.radikal.ru/a27/2109/3b/10a041f4a62a.jpg/htm And this is how Ripgrep works - 5 processes are running at once and still the search goes even with 100% CPU load 2 times faster each x5 process. And you can see for yourself what the RAM consumption has become. Would all this still work on CUDA ))) Screenshot_1 10a041f4a62a

genivia-inc commented 2 years ago

What is this 1K12.txt file?

If this is a very large file of pattern to match then this for sure can create a huge DFA for POSIX matching.

Use ugrep option -P for Perl matching. Ripgrep also internally uses Perl matching to avoid large DFAs.

The traditional grep tool uses POSIX matching, not Perl matching. Grep and ugrep construct regex DFAs internally that can grow very big depending on the regex patterns.

zh76internetru commented 2 years ago

1K1.txt....1K17.txt this is 17 parts of 65,000 lines (2.5 mb) created from one large file 1K.txt (about 42 mb). 221.txt...21100.txt this is 100 parts of 65,000 lines (2.5 mb) created from one large file 22.txt (about 250 mb). Otherwise, the whole files do not fit into RAM. Riprep also does not like such large files and they also do not fit into RAM. So the same 2.5 MB files were used, but the memory consumption is significantly different (as can be seen in the screenshots). Because of this, I can run multiple parallel processes instead of one.

zh76internetru commented 2 years ago

What am I doing wrong? ugrep -nf 1K1.txt 221.txt>>0.txt How/where to put -P ?

Screenshot_2 Screenshot_3

genivia-inc commented 2 years ago

So I was right. This memory usage has to do with the fact that the pattern file is quite long and perhaps also very complex at the same time. The DFA construction for such patterns may take quite a bit of time and space, which is theoretically known (it's not an implementation limitation). This is especially true when the patterns are Unicode.

To turn Unicode matching off, use ugrep option -U (no need to use -P). Use this option -U when patterns and files are just plain ASCII or bytes to match, like standard grep also expects them to be.

For your case with option -P: It looks like PCRE (enabled with option -P) complains about the regex pattern length being of over 2MB long. This is a PCRE error message: "regular expression is too large"

Please make sure to use the latest version of ugrep, because some recent changes were made to improve option -P by avoiding unnecessary memory use (no longer converts the patterns for Unicode matching).

Also, if a pattern file contains strings, not patterns, then use option -F for string matching. In this case each line in the pattern file is considered a string pattern.

zh76internetru commented 2 years ago

The memory consumption did not decrease - it remained the same as it was.

Screenshot_5

I'm staying with Ripgrep

genivia-inc commented 2 years ago

Hm, this is interesting. Sure, it makes perfect sense that memory is still huge because you show this for option -U, which may or may not help. It still produces a large DFA. I did not suggest that -U is the solution. We know that theoretically this can blow up in size. Unicode just makes it worse. Again, ugrep emulates grep POSIX matching with a DFA if option -P is not used. I rather not make -P the default to keep ugrep compatible with grep as much as possible. Ripgrep and other many other grep-like tools are NOT compatible with grep or can be used as a replacement.

The reason this issue was closed is that I was confident that option -P should work just fine to match input with PCRE2. Note that ripgrep and other grep-like tools use PCRE or a similar regex engine.

Perhaps there could be a PCRE2 configuration limit that limits the size of the pattern. However, re-reading the PCRE2 docs it is not obvious because for 32 bit machines the "internal linkage size" of 4 is used, which is "essentially unlimited" see https://www.pcre.org/current/doc/html/pcre2limits.html

Why PCRE2 in your case complains about regex pattern size being too long is not yet clear to me. This is what PCRE2 states:

Perhaps the pcre2_compile() call could be limiting as the first point states. However, the ugrep code uses static_cast<PCRE2_SIZE>(pat_->size()) so it's 4GB for 32 bit systems to allow huge patterns. Otherwise perhaps pcre2_jit_compile() is limiting.

I will have to look into that and report back later.

zh76internetru commented 2 years ago

Screenshot_6 I have x64

genivia-inc commented 2 years ago

Thanks for the info.

So the problem is caused by PCRE2. PCRE2 is installed with a default configuration that by default limits the internal code size, which in turn limits the length and complexity of a regex pattern. The PCRE2 API documentation doesn't state this, but the installation and configuration instructions mention the fact that the link size for 8 bit patterns is only 2 bytes. We need at least 3 bytes. That is the same link size as ugrep's internal DFA for POSIX matching, which has practically no limit in the pattern size, but may result in a large DFA that takes time to construct as is very noticeable in your case (a theoretically known fact).

This means that this problem with option -P for PCRE2 will be relatively easy to fix. However, ugrep can no longer rely on an existing libpcre2 installation of PCRE2 on a system, because that may be limiting. Instead, we need to install libpcre2 together with ugrep and configure it properly. For Windows, this is not an issue, because I've build ugrep.exe from scratch with a local libpcre2 installation. I will need to rebuild ugrep.exe without this limit.

genivia-inc commented 2 years ago

OK, I rebuilt the ugrep.exe x64 executable with a reconfigured PCRE2 library. I tested this out, but could not with patterns that are as large as yours.

Attached is the x64 version 3.3.7 of ugrep.exe with the PCRE2 update (zipped). This is a clean file, no malicious stuff I assure you!! (Trust is hard to find these days.)

ugrep.exe.zip

Let me know what happens with ugrep option -P for Perl matching with PCRE. If this is still not enough, then there is another step I can take to increase the pattern sizes even more, but hopefully this gets you moving forward as a first step.

Thanks for reporting this, because this helps to improve ugrep. The PCRE2 API documentation does not mention this limitation (it is more or less "hidden" in the installation instructions). So I was under the impression there was not pattern size limitation.

Perhaps the title of this issue should be "ugrep option -P does not accept very long patterns".

zh76internetru commented 2 years ago

Thanks. I tried it. -P now freezes altogether. I attached a small set of template+verifiable in the archive and 2 batniki for fresh Grep and Ripgrep for comparison. https://transfiles.ru/8basi Screenshot_1

genivia-inc commented 2 years ago

Thanks for the files. I will take a closer look. I'm surprised to see this. Probably overlooked something in the settings or some other detail in the internals.

Robert-van-Engelen commented 2 years ago

There is indeed a memory use issue with the ugrep algorithm to match strings (the files contains strings, not regex patterns). This algorithm is optimized to recognize string matches and construct a pattern matcher more quickly, but the problem with your string pattens is that the algorithm is too greedy in allocating too much memory. I will address this with an update.

Option -P is a different story. I'm surprised that PCRE2 doesn't appear to work well with this set of strings (as patterns).

genivia-inc commented 2 years ago

Now that a list of new ugrep features are complete and added to the latest ugrep releases (ugrep v3.6.0) as requested by users, I am now returning to work on this excessive RAM usage issue.

Improving ugrep to handle long -F strings patterns is important. Right now, the engine uses too much memory to analyze these patterns. I already knew exactly why this is the case. I am confident that this issue can be fixed.

genivia-inc commented 2 years ago

A quick update on my progress over the last two days.

The new version of ugrep I worked on is a lot faster (2.68 sec) and uses 946MB RAM instead of 4.4GB:

/usr/bin/time -l ~/Projects/ugrep/src/ugrep -cf 1K1.txt 2.txt --stats=vm
0
Searched 1 file: 0 matching
The following pathname selections and restrictions were applied:
  --no-hidden (default)
VM memory: 1891639 nodes (356ms), 1891638 edges (1327ms), 5736220 opcode words (475ms)
        2.68 real         2.26 user         0.41 sys
 946262016  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
    265091  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
       307  involuntary context switches

I still have some work to do to try to optimize this further, if possible.

The old version that uses 4.4GB and runs slower (4.89 sec):

 /usr/bin/time -l ugrep -cf 1K1.txt 2.txt --stats=vm
0
Searched 1 file: 0 matching
The following pathname selections and restrictions were applied:
  --no-hidden (default)
VM memory: 1891639 nodes (499ms), 1891638 edges (3376ms), 5736220 opcode words (496ms)
        4.89 real         3.41 user         1.46 sys
4438999040  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
   1121547  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
      1011  involuntary context switches

Edit: the -F string matching is further optimized to run in 2.13s with 617MB, which looks more promising:

/usr/bin/time -l ~/Projects/ugrep/src/ugrep -cf 1K1.txt 2.txt --stats=vm
0
Searched 1 file: 0 matching
The following pathname selections and restrictions were applied:
  --no-hidden (default)
VM memory: 1891639 nodes (293ms), 1891638 edges (870ms), 5736220 opcode words (440ms)
        2.13 real         1.86 user         0.26 sys
 617988096  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
    151206  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
       298  involuntary context switches
genivia-inc commented 2 years ago

Further improved pattern parsing speed, now down to 1.40s with 592MB:

/usr/bin/time -l ~/Projects/ugrep/src/ugrep -cf 1K1.txt 2.txt --stats=vm
0
Searched 1 file: 0 matching
The following pathname selections and restrictions were applied:
  --no-hidden (default)
VM: 2142062 parse (300ms), 1891638 nodes (0ms), 1891638 edges (300ms), 5724106 opcode words (599ms)
        1.40 real         1.20 user         0.19 sys
 592113664  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
    146943  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
       164  involuntary context switches

Most of the time is spent on pattern parsing and DFA opcode generation for the pattern matcher engine. This can perhaps be optimized further.

genivia-inc commented 2 years ago

Your 2.txt file is small (2MB) with pattern files that are 2MB each. This small file to search gives ugrep a slight disadvantage, because it runs faster on larger files with the new hashing algorithm I invented and implemented. The latest ugrep improvements now clearly show that ugrep runs faster than say GNU grep on a 100MB file enwik8.

GNU grep takes 2.81 seconds:

/usr/bin/time ggrep -cf 1K1.txt enwik8
0
        2.81 real         2.71 user         0.06 sys

Ugrep for the same search takes 1.56 seconds:

/usr/bin/time ~/Projects/ugrep/src/ugrep -cf 1K1.txt enwik8
0
        1.56 real         1.33 user         0.21 sys

As you point out, the older ugrep version suffers from RAM usage slowdown, which will be fixed in the upcoming v3.7 release.

genivia-inc commented 2 years ago

Ugrep v3.7.0 is released with lower RAM usage and significant speed improvements for long and complex patterns. See also issue #188.

genivia-inc commented 2 years ago

Let me also suggest a more effective way to speed up your searches. Instead of using one large pattern file that generally consumes a lot of memory with any grep tools, instead you can split the file up into several pattern files and then run searches in parallel. This works on Windows cmd prompt as follows:

start \B ugrep -cf 1K1.txt 2.txt > 1K1.out & start \B ugrep -cf 1K2.txt 2.txt > 1K2.out

then concatenate the results if desired (after the ugrep searches are finished):

type 1K1.out 1K2.out > 1K12.out

The same, but with four parallel searches:

start \B ugrep -cf 1K1.txt 2.txt > 1K1.out & start \B ugrep -cf 1K2.txt 2.txt > 1K2.out & start \B ugrep -cf 1K3.txt 2.txt > 1K3.out & start \B ugrep -cf 1K4.txt 2.txt > 1K4.out

This runs faster on multicore machines, but also because the complexity of the patterns is lower for each search. This generally requires less memory to run with ugrep and other grep tools. Lower memory use benefits performance due to detrimental memory hierarchy effects (caches).

garry-ut99 commented 1 month ago

This stupid issue is still here: ug -vxFf A.txt B.txt > output.txt begins quickly eating RAM up to incredible ammout: 13GB out of 16GB available in a matter of 1 minute, and then Windows displays low memory warning, and then the program is terminated, leaving 0 byte garbage output file. Grep 3.11 ate 9 GB and didn't crash, but even 9GB is a bad joke.

I just tried to do a popular operation: to get rid of duplicate lines from B compared to A. File A.txt is 278 MB (short text lines), file B.txt 661 MB (short text lines). The output.txt file is estimated to be somewhere between 383MB or more. All the 3 files take ~1.33 GB. Which means, program tries to eat additional 11,67 GB or even more, it's almost 10x more than the size of all the 3 files combined together. I'm curious how much more it would take, if I had 32 GB of RAM: would it take 16GB or 20GB?! or 30GB and then crashed?!. I do understand that the program takes additional RAM to incerase speed, but for the fu..ing hell, it should not eat all the available RAM and crash, this is bad design flaw or bug, there should be some balance point between speed and RAM usage or some option to limit RAM usage.

Let me also suggest a more effective way to speed up your searches. (...)

No thanks, sorry, but this "workaround" proposition scares me. I have many large files on my disk. I'm not interested to fuck up with my files, spend additional time, waste effort, and create strangely long command lines every time I want to do a simple operation like removing duplicate lines from a fu..king file, especially when having 10x more available RAM than taken by the 3 files. The burden of splitting files into smaller chunks should be shifted to the program, and the splitting should be automatically handled by the program, as a part of the algorithm. It's bad design flaw or too RAM hungry algorithm or a bug (whatever it's considered to be), there is too many stupid programs eating dozens of GB of available RAM and then stupidly crashing, even when provided not too big input files and having 10x more free RAM than size of the files.

genivia-inc commented 1 month ago

Option -f A.txt consumes a file with a set of patterns to search. Naturally, it takes more memory to store this information than just the size of A.txt alone. It depends on what the patterns represent and how many patterns we have to store. Expect multiple times the size of A.txt. The reason is that these A.txt strings are converted and optimized for searching by constructing a tree, where each node in the tree stores a single letter. The nodes are then converted to a virtual machine. Each node (letter) in the tree takes 16 bytes to 32 bytes (or even more, depending on the tree branches). The resulting VM also takes 16 to 32 bytes per node. That's how regex matching with the VM is so fast in ugrep, see RE/flex regex scanning speed comparisons (scanning is repeated regex matching, but we're not searching ahead at all).

It is not a problem for ugrep to have thousands of patterns in the file for option -f. But if the patterns (or strings) are long and they don't overlap much, then the resulting RAM usage will be much larger than the size of this input file. This RAM usage is hard to predict in advance, but we could simply bail out if the tree grows too large.

When you use option --stats=vm then you'll see the tree usage like the number of nodes, edges and the number of VM words generated. Each VM word is 32 bits.

There are algorithms that are more optimal to search many strings in the input instead of using a tree and VM as in ugrep, which could be considered for ugrep as an enhancement. On the other hand, this won't work when we need to store a regex pattern file (instead of a set of strings).

garry-ut99 commented 1 month ago

Hello, I also made a test with rg, which has failed too:

Also adding some more information to my previous comment: A.txt and B.txt contain domain names, example (small part):

SPOILER 1 ``` humanbusiness.substack.com humanbusinessworks.com humanbydefault.bandcamp.com humanbydesign-perkinseastman.medium.com humanbyergalis.fr humanbyorientation.com humanbytalent.com humanc.nl humancad.com humancalendar.com humancalibur.click humancannonball.bandcamp.com humancannonball.de humancannonball.us humancanvasproject.com humancap.com humancap.ventures humancapabilityinitiative.org humancapable.com humancapital-hc.com humancapital.aon.com ```

Also such ones:

SPOILER 2 ``` ärzte-für-aufklärung.de çç.com.es étudiant.es ínstaslot.com ônion.com ögr.at önion.com öpnvkarte.de österreich.at ü.news überbrückungshilfe-studierende.de übersetzerdatenbank.net 0------------0-------------0.0n-line.biz 0------------0-------------0.0n-line.info 0------------0-------------0.123erp.net 0----0.0x28.com 0----0.info 0----q.tumblr.com 0---0.info 0---00055.0000hugetits.0-108.btcc.com 0--0----------------------------------------------------------0.ru 0--0---------------------------------------------------axelname.ru 0--0--------dra-dot-ru-premium-domains-and-services-----------0.ru 0--0--0-364f0c42.int.ad.smartnews.net 0--0-0.ru 0--0.0--0.blue-jade.com 0--0.0--0.blue-jade.net 0--0.0--0.elmofo.com 0--0.0--0.g7vn.com 0--0.0--0.hnun.net 0--0.0--0.htr-labs.net 0--0.00-com.info 0--0.info 0--0.moneyextravaganza.com 0--0003.04.openvpn.cloud.api.btcchina.com 0--0003.04.openvpn.cloud.cloud.api.btcchina.com 0--0003.04.openvpn.cloud.cloud.btcchina.com 0--0003.04.openvpn.cloud.openvpn.btcchina.com 0--0004.0000sexygia.0000-110.btcc.com 0--000hugetits.0000hugetits.0-108.btcc.com 0--000sweetkelly.0000mg.0-108.btcc.com 0--4.com 0--51.openvpn.www.eth.bw.com 0--bypass.info 0--d.com 0--e.info 0--efvy88h.bitcoin.com 0--firefox.info 0--foodwarez.da.ru 0--liamariejohnson--0.nedrobin.net 0--loook.net 0--m.homes 0--proxy.info 0--skitch--0.tt.omtrdc.net 0--www.stripchat.com 0--y.com 0-0-----------------------------------------------------------0.com 0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-a.de 0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-a.de 0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0.com 0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-harvestgoodnode.work 0-0-0-0-0-0-0-0-0-0-0-0-0-10-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-17-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-18-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-20-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-26-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-27-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-3-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-30-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-33-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-36-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-37-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-38-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-45-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-49-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-5-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-53-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-54-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-59-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-6-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-7-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-8-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0-0-0-0-9-0-0-0-0-0-0-0-0-0-0-0-0-0.info 0-0-0-0-0-0-0-0-0-0.com 0-0-0-0-0-0-0-0-0.20megsfree.com 0-0-0-0-0-0-0.com 0-0-0-0-0-0.com 0-0-0-0-0-0proxy.tserv.se 0-0-0-0-0.cn.com 0-0-0-0-0.com 0-0-0-0-0.de 0-0-0-0-0go.24ha7.com 0-0-0-0-0proxy.emntest.com 0-0-0-0-0proxy.sochy.org 0-0-0-0.2itb.com 0-0-0-0go.00-org.info 0-0-0-0go.sxn.us 0-0-0-0proxy.macrofox.org 0-0-0-0surf.00-net.info 0-0-0-11vernm.com 0-0-0-birkart-5wecanary-banzai-ui-webt-ifftm24b-analytics.com.com.expediagroup.com 0-0-0-birkart-5wecanary-banzai-ui-webt-ifftm24b-analytics.expediagroup.com 0-0-0-birkart-clws-ifs-globalmn.net.daraz.com 0-0-0.6te.net 0-0-0.com 0-0-0.it 0-0-0.lecktronix.net 0-0-0.macvillage.net 0-0-0.msgserver.net 0-0-0.ru 0-0-0.wakaf.net 0-0-0.xaper.com 0-0-0.yearbookhigh.com 0-0-04atakoyxrst2102esenkentt31.facebookbrand-2018-dev.fb.com 0-0-09zh.facebookbrand-2018-dev.fb.com 0-0-0checkmate.com 0-0-0proxy.emntest.com 0-0-0proxy.info 0-0-0yd12pantivirus-extension.facebookbrand-2018-dev.fb.com 0-0-1.ru 0-0-1.xyz 0-0-1009-37.1.nflxso.net 0-0-11-0-0.com 0-0-1722-2773.1.nflxso.net 0-0-24.ru 0-0-3.ru 0-0-3c76b6f1aebc03c084elopers.buildyourdata.com 0-0-7-analytics1942arketage078nautil011659tindocsg-devops.api-chinaweb01ocs-analytics622mctlbrand-okta-idp.performancehorizon.com 0-0-7-analytics204078nautil02307enticationdoc-code10600ndocinx.web01ocs-00000-okta-idpapi-china.performancehorizon.com 0-0-7-analytics204078nautil0authenticationdoc-code10600ndocinx.okta-china.performancehorizon.com 0-0-7-analyticsapexit078nautil0v1enticationg-code10600nginx.okta-api-china.performancehorizon.com 0-0-7-analyticsdocs078nautil0uploregiong-code10600nginx.okta-api-china.performancehorizon.com 0-0-7-analyticsdocs078nautil0uploregiong-marketingnginx.api-web01ocs-00000-okta-idp.performancehorizon.com 0-0-7-analyticsdocs2883nautil0uploadg-codeweb01ocsnginx.okta-china.performancehorizon.com 0-0-7-analyticsdocsctlnautil0uploadg-marketageweb01ocs.api-okta.performancehorizon.com 0-0-7-analyticsdocsctlnautil0uploadg-marketageweb01ocs.okta.performancehorizon.com 0-0-7-analyticsdocsctlnautil0uploadg-marketageweb01ocs.web01ocs-analytics622mctlbrand-okta-idp.performancehorizon.com 0-0-7-analyticsdocsctlnautil0uploadg-svn.okta-china.performancehorizon.com 0-0-7-analyticsk8s078nautil0authenticationg-code10600nginx.web01ocs-00000-okta-idp.performancehorizon.com 0-0-7-analyticsmerchant078nautil0uploadg-code1ce01ocsnginx.web01ocs-analytics622mctlbrand-okta-idp.api-china.performancehorizon.com 0-0-7-certdbocscontentcom01ocs-demophgconsole30701ac9a.api-chinaweb01ocs-30northgroup-okta-idp.performancehorizon.com 0-0-7-st611ds-codeweb01ocsn611dsinx.15-okta-api-china.performancehorizon.com 0-0-7-stosexternalj-codeweb01ocsnosexternaljinx.okta-china.performancehorizon.com 0-0-7-webdbocsweb01ocs-analytics622mctlbrand-okta-idp01ocs-576.api-china.performancehorizon.com 0-0-7-webdbocswwwv3dochomer01ocs-demophgconsoledevac9a.web01ocs-analytics622mctlbrand-okta-wwwcomignite17-api-china.performancehorizon.com 0-0-7.in 0-0-7.ru 0-0-8-8-8.icu 0-0-8-8.icu 0-0-9-8.icu 0-0-aaatent.xyz 0-0-ads-transparency-071ee8838b58fac5.facebookbrand-2018-dev.fb.com 0-0-ads-transparency-0fs1tf300gayrettepexrst2104bagcilart31.facebookbrand-2018-dev.fb.com 0-0-ads-transparency-5p2qz.facebookbrand-2018-dev.fb.com 0-0-ads-transparency-mrc53.facebookbrand-2018-dev.fb.com 0-0-ads-transparency-nai92.facebookbrand-2018-dev.fb.com 0-0-ads-transparency.facebookbrand-2018-dev.fb.com 0-0-ads-transparencycob87-1-88-179-90-136.facebookbrand-2018-dev.fb.com 0-0-adult-superstore.com 0-0-api-02d20ad797614c3ab9c24e5a9992fd2c.facebookbrand-2018-dev.fb.com 0-0-api-0b60e520yktfblty25ner5lf1i.facebookbrand-2018-dev.fb.com ```

By the way, ass seen the rg guy is rude:

genivia-inc commented 1 month ago

The content of A.txt doesn't matter much if parsed as fixed strings with -F. Because these are all domain names that are probably unique with little to no overlap, there is a lot of information to store in the internal tree in ugrep and consequently a very large set of virtual machine words for regex matching.

As I've mentioned, there are specific algorithms for this type of problem to search many strings in some text, such as Rabin-Karp and Aho-Corasick. The latter is used with fgrep and also uses a finite state machine (though a different kind than ugrep), which means that it takes more space in memory than the file with patterns to search. I don't know if you have tried fgrep for your use case? It might also not work due to these space requirements.

garry-ut99 commented 1 month ago

genivia-inc : I don't know if you have tried fgrep for your use case?

I already did and it worked, like mentioned at the beginning of my initial comment (I used the same options like with ug and rg: grep -vxFf):

garry-ut99 : https://github.com/Genivia/ugrep/issues/153#issuecomment-2272449057 : Grep 3.11 ate 9 GB and didn't crash

under an assumption that fgrep and grep -F are the same: https://pubs.opengroup.org/onlinepubs/7990989799/xcu/fgrep.html :

fgrep - search a file for a fixed-string pattern The name fgrep is an obsolescent version equivalent to grep -F.

I even now additionally tried fgrep.exe legacy binary that comes with older grep.exe 2.5.4 binary: https://gnuwin32.sourceforge.net/packages/grep.htm, with fgrep -vxf A.txt B.txt > output.txt: after few seconds it has failed: fgrep: memory exhausted with max 2GB of RAM being used.

grep -vxFf worked fine in the current case, but it ate 9GB, which means it might not work in other cases, as it's not far from exhausting all available RAM memory, a little bigger files and it might fail. I mean grep is still too much RAM hungry.

genivia-inc commented 1 month ago

Well, the pattern files are just too big in practice to do this fast enough and without running into memory issues with grep-like tools, which is not that surprising IMO. Even if a grep tool accepts the patterns, it will be very slow to search B.txt because there are a lot of patterns to match.

There is no reason to approach this problem in only one way, especially when we know that it will never be as fast as we want or need it to be. There are many other ways this can be done more efficiently.

Another way is to extract the domain names from a file to search like B.txt then check if each of these domain names is in a database. That is a lot more efficient. Instead of a database, you could search each domain name in the A.txt file for a match, but that would be slower.

To find the domain names in B.txt:

ug -P -o '(https?://|www\.)[-a-zA-Z0-9@:%._+~#=]{1,253}\.[-a-zA-Z0-9]{2,}' B.txt | sort -u | sed -E 's/^https?:\/\///'

This retrieves domain names from B.txt that start with http:// or https:// or www. Store these in a file domains.txt (by redirection > domains.txt). You can then try the following, but it won't be as fast as searching a database!!

ug -m1 -F -f domains.txt A.txt

This will go through all A.txt domains to find the first match (-m1) only (or if you want all matches then don't use -m1). This last step is not efficient compared to searching a database of domain names, because a DB engine indexes primary keys to find matches fast.

garry-ut99 commented 1 month ago

Lets clear one thing: when in my earlier comments I said: "A.txt and B.txt contain domain names", it was supposed to mean that they "contain only domain names and nothing more" (since I didn't mention anything else being present in the files than just domains, I thought it will be sufficient to imply that the files don't contain anything more than just domains). Hence there is no need to extract domains from them, even more: domains in both files are already sorted in a-z order and with duplicate lines removed, hence the step with creating "domains.txt" can be skipped, as there already are only domains in the both files.

So I jumped directly to step 2: your command prints out duplicate lines between A and B, instead of non duplicate lines from B (bigger file), but this is not what I needed, as B needs to be cleaned from domains that are already present in A, even if probing database is faster, I don't see how printing duplicate lines can be useful in such case.

Also what do you exactly mean by "database"? ugrep indexing feature (LINK) or some other database? I tried https://lucene.apache.org/ in the past but without success.

genivia-inc commented 1 month ago

A SQL DB or other DB would work.

Well sorry I didn't get the details of what's in B.txt. I was just commenting on the general case how to approach this task to search files for domain names. Also, I get a lot of questions per day and very busy otherwise, so I'm not rereading every detail of a long thread like this one.

If all that you have is two sets of domain names, then your question can be reformulated as to find the intersection of the two sets. The intersection are the domain name matches. When the two sets are sorted by some key (e.g. lexicographically) then this is trivial to do in linear time ${\cal O}(|A|+|B|)$ for sets $A$ and $B$ (assuming string comparison is ${\cal O}(1)$ i.e. unit time). Since sorting takes ${\cal O}(n\,\log n)$ time, the whole task takes no more than $n=\max(|A|, |B|)$.

garry-ut99 commented 1 month ago

to find the intersection of the two sets

Most likely, to be 100% clear and no doubts, simplified example: (of course empty lines should be removed in output.txt, I left them unremoved purposely just to better demonstrate, also bolded lines which are duplicated)

A.txt B.txt output.txt
aa aa
bb ba ba
hh bb
gg ca ca
cc cc
gg
vv vv
xx xx

And keep in mind the point is to not eat too much RAM and to not crash.

Basically it's a task for "file compare tools", I tried WinMerge - an open source fast file comparer:

I used "full text compare" with some options ticked like "ignore end-lines" etc, but it still behaves exactly like ug and rg, started eating RAM, then after ~1 min it took 13GB of 16GB and out of memory error, then crash-terminated.