VFPX / GoFish

GoFish is an advanced code search tool for fast searching and replacing of Visual FoxPro source code.
8 stars 7 forks source link

Dynamic file and folder list. #256

Open myearwood1 opened 3 months ago

myearwood1 commented 3 months ago

đź“ť Provide a description of the new feature

What is the expected behavior of the proposed feature? What is the scenario this would be used?

@Jimrnelson

I suspect that even with Grep, GoFish is still using some method to access the directory, such as recursing folders with ADIR or multiple calls to DIR .prg>outputfile, DIR .??a>outputfile.txt

What if there was a table that was updated in the background while a developer works? GoFish could access that table and do a single query like this: select * from c_temp where inlist(fileext,'PRG','SCA','FRA','VCA','LBA','MNA') into cursor c_temp1 NOFILTER

That takes .4 seconds on a cursor with all the files in my c drive. 992,000 files. I tried my DirX on my dev folder to build a cursor of all files which took .21 seconds. The query above took .012 seconds.

This could be a separate project. Do you see it saving you time and/or reducing programming? It would need start, stop and refresh functions. It should use buffering to update the table.


If you'd like to see this feature implemented, add a đź‘Ť reaction to this post.

Jimrnelson commented 3 months ago

@myearwood1

A lot to say here:

I suspect that even with Grep, GoFish is still using some method to access the directory, such as recursing folders with ADIR or multiple calls to DIR .prg>outputfile, DIR .??a>outputfile.txt

This time your suspicion is incorrect.  When GF uses grep to search directories, there is nary a call to DIR or ADIR in sight.

grep performs two functions for GF (and does them very fast)

GF then searches through the list of possible matches, one file at a time ("normal" GF processing).

In my test folder of 8,000 files:

This shows that using grep is about 5 times faster (this ratio gradually decreases with the number of matches). 

This demonstrates that the path to optimization is to minimize the number of files for normal" GF processing.  Thus I do not think there is any advantage to maintaining a separate list of files for GF to process.


Which brings up  xargs.exe, which you suggested a week or so ago as a path to use when using GF for files in a project. I had high hopes for continued success and spent considerable time implementing it within GF.  However, the results were extremely disappointing.

For my test project with ~2,000 files:

I believe that the underlying problem here is that grep is hardwired to rapidly search directory trees but does not have a native way to read a list of files.  Using xargs (in "chunks" of files about 23K bytes each) apparently adds so much overhead as to make this approach unusable.

I am very satisfied with the dramatic improvements we have achieved in searching folder trees.  So far, there has been no suggestion for searching list of files that has proved profitable.

myearwood1 commented 3 months ago

@Jimrnelson

I'm glad we worked through getting grep incorporated. I understand that grep is getting the files. It's because you were asking about lists of files hence xargs - which I had no time to test - that I was trying to think up a faster way to scan the set of files. Think back to my original sql which you said was a weak example - if we instantiate a regex object before the sql, then refer to that object in the sql, there will be little overhead. So I'm guessing a .012 query of a subset of the files, which asks regex to test the file. That is similar to the locate command that gave you a 7 times boost. As long as there is a cursor of the files to be scanned, that might do it.

On Mon, Jun 17, 2024, 7:09 p.m. Jimrnelson @.***> wrote:

@myearwood1 https://github.com/myearwood1

A lot to say here:

I suspect that even with Grep, GoFish is still using some method to access the directory, such as recursing folders with ADIR or multiple calls to DIR .prg>outputfile, DIR .??a>outputfile.txt

This time your suspicion is incorrect. When GF uses grep to search directories, there is nary a call to DIR or ADIR in sight.

grep performs two functions for GF (and does them very fast)

  • traversing the list of files in the directory and sub-directories
  • determining which files have possible matches to the search string.

GF then searches through the list of possible matches, one file at a time ("normal" GF processing).

In my test folder of 8,000 files:

  • GF without grep took 11.9 seconds (less than 5% of that spent generating the list of files)
  • GF with grep took 2.4 seconds

This shows that using grep is about 5 times faster (this ratio gradually decreases with the number of matches).

This demonstrates that the path to optimization is to minimize the number of files for normal" GF processing. Thus I do not think there is any advantage to maintaining a separate list of files for GF to process.

Which brings up xargs.exe, which you suggested a week or so ago as a path to use when using GF for files in a project. I had high hopes for continued success and spent considerable time implementing it within GF. However, the results were extremely disappointing.

For my test project with ~2,000 files:

  • "Normal" Gf took 7.14 seconds
  • Using xargs.exe to call grep.exe took 11.42 seconds (about 60% slower)

I believe that the underlying problem here is that grep is hardwired to rapidly search directory trees but does not have a native way to read a list of files. Using xargs (in "chunks" of files about 23K bytes each) apparently adds so much overhead as to make this approach unusable.

I am very satisfied with the dramatic improvements we have achieved in searching folder trees. So far, there has been no suggestion for searching list of files that has proved profitable.

— Reply to this email directly, view it on GitHub https://github.com/VFPX/GoFish/issues/256#issuecomment-2174592549, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABL36W4BK52MJPGBLOR6KS3ZH5UDBAVCNFSM6AAAAABJOUFD6WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZUGU4TENJUHE . You are receiving this because you were mentioned.Message ID: @.***>

Jimrnelson commented 3 months ago

@myearwood1

You said:

if we instantiate a regex object before the sql, then refer to that object in the sql, there will be little overhead. So I'm guessing a .012 query of a subset of the files, which asks regex to test the file. That is similar to the locate command that gave you a 7 times boost. As long as there is a cursor of the files to be scanned, that might do it.

Unfortunately, I have disappointing news on this front. I have tested this technique and found only negligible improvement (about 6%).

What GF does "normally" is to scan a cursor of file names and for each file it performs FileToStr and then RegEx on the contents.

Your suggestion has moved that process so that it occurs within the Select statement, but it still is necessary to perform FileToStr and Regex on each file.

The negligible savings occurs because Select is slightly more effective in this case than the looping in the normal case, but the meat of what is happening, the FileToStr/Regex, still most be performed for each file.

It is now my belief that to achieve any substantial savings we would need to find somethings completely outside of VFP, some Windows utility (like you found Grep.exe), that can work on an entire list of files.

myearwood1 commented 3 months ago

Let me see if I follow. You request a utility that can scan a hierarchy of folders to produce a cursor of files, or deal with an existing cursor of files and do a regex on the listed files without filetostr?

Jimrnelson commented 3 months ago

Let me see if I follow. You request a utility that can scan a hierarchy of folders to produce a cursor of files, or deal with an existing cursor of files and do a regex on the listed files without filetostr?

Actually, not quite either.


Conceptually, there are four steps within the GF search engine:

  1. Create a list of files to be searched
  2. For each file, get the text representation of the file
  3. Search the text representation for the file to see if there might be any matches
  4. If there might be any matches, do a more precise search to locate and record those matches,

For "normal" GF searching in a directory / sub-directory, these becomes

  1. ADIR
  2. FileToStr
  3. Grep
  4. Custom Search Engine code

When searching directories and their sub-directories, the recent optimization of a few weeks ago uses grep.exe to combine the first three steps into one and do so much faster that VFP was able to (as noted, maybe 4-6 times faster.)


The unsolved problem is how to optimize this search when step 1 obtains the list of files from a different source then ADIR. (From the list of files in a project of list of projects, e.g.)

So what is desirable is the following: a Windows utility that does a grep-like search on a list of files (presumably a text file, one fully qualified file name per line), with a result of one line per each file with a match.

Note that earlier we tried using xargs to pipe a list of file names to grep, but this turned out to be much slower that the normal GF.

myearwood1 commented 3 months ago

What I was originally hoping we could do is include ADirEx from Christian Ehlscheid. I was going to add regex to my DirX, but it would be faster if he added it to his ADirEx. His FLL works inside VFP. His utility gets all 20,150 files in my dev folder into a cursor in .287 seconds. If I tell it to do ".??a;.prg" recursive, it takes .241 seconds. That is running his utility twice. I bet he could do it faster. If he supported building the cursor while recursing the folders and searching, that could be as fast as grep. If he also could scan an existing cursor while searching, that could do what grep and xargs cannot do. If he adds a new feature to scan an existing cursor, then we could produce a set of results and then scan that set for a different expression, and so on. That sounds pretty good to me. I asked him. Let's see what he says.

It would do it without filetostr and somewhat outside of vfp.

Jimrnelson commented 3 months ago

@myearwood1

You've put together a lot of IFs there. I think that this issue should be put on hold until we have a clear statement from him.

It sounds like he would be replicating grep in some way (at least the part of searching regular expressions). Interesting.

Note that this is not on my radar as a high priority project. I am not aware of any interest in the community that searching projects needs to be any faster. (For me, I have no need to improve on the 7 seconds it takes to search all of my projects combined.)

myearwood1 commented 3 months ago

That is the problem with democracy - you get less than optimal. Nobody cared for 12 years - what you called natural. That is not acceptable to me. If he and I can make something you can use, don't look a gift horse in the mouth or offend those pushing the envelope by suggesting it's a waste of time. It's 2 ifs, one to add searching to ADirEx and a new feature to scan a cursor to another cursor. He also provides the source code, so a C++ guy could help as long as it doesn't end up getting called ADirSEX. I'll gladly take any input you have on making it suitable.

Jimrnelson commented 3 months ago

@myearwood1

Please, there's no need for you to be like that, Mike.


Here's the concept for what I need (as if what will be provided will be a VFP procedure). Three parameters:

[1] A list of file names [2] Search expression to be used to call grep [3] Name for the list of file names result.

For each file in [1], the grep expression is used in [2] to determine if the file is to be included in the result in [3].

myearwood1 commented 3 months ago

I always get attempted shoot downs from everyone. Have you no imagination? You are content with a status quo? I swear Mr. Scott could travel back in time with a replicator and the majority of average intelligence humans would say, Naw, I'm good.

myearwood1 commented 3 months ago

I agree with your concept. As I see it, 1 we need an ADIR like function which can scan the folders and grep the files and make a cursor. IMO ADIR was a mistake from the beginning. It should have been DIR. 2 SQL like function that can scan an existing array/cursor and produce a new cursor.

I've already used Christian's ADirEx to do 2 queries and compare them to find differences with a single SQL command. Can't do that with ADIR and arrays. Status quo. Ha.

Jimrnelson commented 3 months ago

@myearwood1

In your post that initiated this issue, you discussed the possibility of maintaining a table of files to be searched, presumably as a way to shorten the time GF needs to search the files. You wrote:

This could be a separate project.

Yes, I fully agree that this should be a separate project. GoFish is an advanced code search tool (as stated on its home page) and thus I do not believe that its scope should include maintaining the list of files to be searched. A separate project would be the correct path, especially considering that such a project might well grow into significant complexity unrelated to searching code.

Do you see it saving you time and/or reducing programming?

No, it would not save any time nor reduce any programming, nor would it require any additional time or programming. There is already an option "Custom UDF" in the Scope dropdown that allows anybody to write their own code to add records to a cursor of all files they would like to have searched. In this case, that would be one line of code to read from the table of files and insert it into the result cursor.

myearwood1 commented 3 months ago

and as always on this planet, limited thinkers do not try to think outside the box. You originally refused to even entertain the idea of using Grep. Before Grep gofish was not "Advanced". What you don't understand is this: I look for any way to make things as fast as possible for me and possibly the mob DESPITE the resistance, rudeness, attacks and even illegal actions of the mob. A file monitoring utility would mean no need to scan the directory itself every time. Fox can roar through an existing cursor of files. Building that list takes more time than updating the cursor.

The benefit of that would be to extract all files with certain names and then having something like ADirEx scan and regex those files, since Grep in their limited thinking cannot do what FoxPro and ADirEx could do.

oShell = Createobject ("wscript.shell")
a=seconds()
oShell.Run("cmd /c dir.exe c:\myfolder\*.* /s", 7, .T.)
?seconds()-m.a

a=seconds()
dirx('c_temp','c:\myfolder\*.*','',.T.)
?seconds()-m.a

The wscript.shell takes 2.869 seconds on my computer. The DirX takes 0.249 seconds. If the cursor already existed:

a=seconds()
select * from c_temp where FILEEXT like "%A" OR FILEEXT LIKE 'PRG' into cursor c_temp2
?seconds()-m.a

would take 0.022 seconds resulting in 6995 records.

A new vfp2c32.fll feature to scan and grep those files all in some wonderland of speed between Fox and Windows sounds good to me.

If the scanning of the files and regex - which vfp2c32 source code already mentions - can be done in one massive burst, that seems worthwhile to me and to gofish.

I may make the changes to vfp2c32 myself and offer them to Christian. We have 52,000 pictures. We add to that pile intermittently.

a=seconds()
dirx('c_pictures','c:\cj_appl\appl\pictures\*.*','')
?seconds()-m.a

DirX takes .478 seconds to put those files into a cursor and index them.

a=seconds()
select * from c_pictures where Fname like "R31502%" into cursor c_temp3
?seconds()-m.a

That takes .011 seconds to find 2 files, then regex them with a modified vfp2c32. I'd gladly accept that. So, you can have something that can recursively scan the folders in .4 seconds instead of 2.8 seconds and potentially something that can rip through 50,000 filenames in .010 seconds.

By the way, I believe you mentioned using GoFish5 to search the source of GoFish7. If you build GoFish7.app, and put it in a separate folder, you can use GoFish7 to gofish the Gofish source code.

myearwood1 commented 3 months ago

The monitoring function would be optional.

myearwood1 commented 3 months ago

EDIT: Wait before you try this. I am going to incorporate PCRE so we get a better set of RegEx abilities.

@Jimrnelson

RegExpFileList.zip

I made a tiny c++ program to read filenames from one file, perform a regexp on each, and output matching files to output.txt

I tested it with an input file of all ??a files in my project folder which is 5471 files.

dir *.??a /s /b > input.txt RegExpFileList input.txt THERE\s+IS\s+ output.txt output.txt contains 41 files.

This should be useful for scanning the first set of matches to produce a second set of matches.