DFO-Ocean-Navigator / netcdf-timestamp-mapper

Maps timestamps (and variables) to a corresponding nc file using sqlite3.
https://dfo-ocean-navigator.github.io/netcdf-timestamp-mapper/
GNU General Public License v3.0
0 stars 1 forks source link

Regex not working. #11

Open dwayne-hart opened 4 years ago

dwayne-hart commented 4 years ago
(navigator) buildadm@u1604-on-production-b:~/ocean-nav/db$ ./nc-timestamp-mapper -n RIOPS-FC-2D-LL -i /data/eccc_forecasts_netcdf/riops_forecast/2D/LL -o /home/buildadm/ocean-nav/db -r ^2019.*(00|06|12|18)_00[0-5]_2D_latlon0.1x0.1.nc$ --dry-run -h
---DRY RUN---
List of non-indexed files not found. Continuing with complete indexing operation...
Creating list of all .nc files in "/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/"...
No .nc files found.
Exiting...

I tested the regex by using the https://regexr.com site.

Expresion: ^2019.*(00|06|12|18)_00[0-5]_2D_latlon0.1x0.1.nc$

Text
2019112100_000_2D_latlon0.1x0.1.nc
2019112100_001_2D_latlon0.1x0.1.nc
2019112100_002_2D_latlon0.1x0.1.nc
2019112100_003_2D_latlon0.1x0.1.nc
2019112100_004_2D_latlon0.1x0.1.nc
2019112100_005_2D_latlon0.1x0.1.nc
2019112100_006_2D_latlon0.1x0.1.nc
2019112100_007_2D_latlon0.1x0.1.nc
2019112100_008_2D_latlon0.1x0.1.nc
2019112100_009_2D_latlon0.1x0.1.nc
2019112100_010_2D_latlon0.1x0.1.nc
2019112100_011_2D_latlon0.1x0.1.nc
2019112100_012_2D_latlon0.1x0.1.nc
2019112100_013_2D_latlon0.1x0.1.nc
2019112100_014_2D_latlon0.1x0.1.nc
2019112100_015_2D_latlon0.1x0.1.nc
2019112100_016_2D_latlon0.1x0.1.nc
2019112100_017_2D_latlon0.1x0.1.nc
2019112100_018_2D_latlon0.1x0.1.nc
2019112100_019_2D_latlon0.1x0.1.nc
2019112100_020_2D_latlon0.1x0.1.nc
2019112100_021_2D_latlon0.1x0.1.nc
2019112100_022_2D_latlon0.1x0.1.nc
2019112100_023_2D_latlon0.1x0.1.nc
2019112100_024_2D_latlon0.1x0.1.nc
2019112100_025_2D_latlon0.1x0.1.nc
2019112100_026_2D_latlon0.1x0.1.nc
2019112100_027_2D_latlon0.1x0.1.nc
2019112100_028_2D_latlon0.1x0.1.nc
2019112100_029_2D_latlon0.1x0.1.nc
2019112100_030_2D_latlon0.1x0.1.nc
2019112100_031_2D_latlon0.1x0.1.nc
2019112100_032_2D_latlon0.1x0.1.nc
2019112100_033_2D_latlon0.1x0.1.nc
2019112100_034_2D_latlon0.1x0.1.nc
2019112100_035_2D_latlon0.1x0.1.nc
2019112100_036_2D_latlon0.1x0.1.nc
2019112100_037_2D_latlon0.1x0.1.nc
2019112100_038_2D_latlon0.1x0.1.nc
2019112100_039_2D_latlon0.1x0.1.nc
2019112100_040_2D_latlon0.1x0.1.nc
2019112100_041_2D_latlon0.1x0.1.nc
2019112100_042_2D_latlon0.1x0.1.nc
2019112100_043_2D_latlon0.1x0.1.nc
2019112100_044_2D_latlon0.1x0.1.nc
2019112100_045_2D_latlon0.1x0.1.nc
2019112100_046_2D_latlon0.1x0.1.nc
2019112100_047_2D_latlon0.1x0.1.nc
2019112100_048_2D_latlon0.1x0.1.nc
2019112106_000_2D_latlon0.1x0.1.nc
2019112106_001_2D_latlon0.1x0.1.nc
2019112106_002_2D_latlon0.1x0.1.nc
2019112106_003_2D_latlon0.1x0.1.nc
2019112106_004_2D_latlon0.1x0.1.nc
2019112106_005_2D_latlon0.1x0.1.nc
2019112106_006_2D_latlon0.1x0.1.nc
2019112106_007_2D_latlon0.1x0.1.nc
2019112106_008_2D_latlon0.1x0.1.nc
2019112106_009_2D_latlon0.1x0.1.nc
2019112106_010_2D_latlon0.1x0.1.nc
2019112106_011_2D_latlon0.1x0.1.nc
2019112106_012_2D_latlon0.1x0.1.nc
2019112106_013_2D_latlon0.1x0.1.nc
2019112106_014_2D_latlon0.1x0.1.nc
2019112106_015_2D_latlon0.1x0.1.nc
2019112106_016_2D_latlon0.1x0.1.nc
2019112106_017_2D_latlon0.1x0.1.nc
2019112106_018_2D_latlon0.1x0.1.nc
2019112106_019_2D_latlon0.1x0.1.nc
2019112106_020_2D_latlon0.1x0.1.nc
2019112106_021_2D_latlon0.1x0.1.nc
2019112106_022_2D_latlon0.1x0.1.nc
2019112106_023_2D_latlon0.1x0.1.nc
2019112106_024_2D_latlon0.1x0.1.nc
2019112106_025_2D_latlon0.1x0.1.nc
2019112106_026_2D_latlon0.1x0.1.nc
2019112106_027_2D_latlon0.1x0.1.nc
2019112106_028_2D_latlon0.1x0.1.nc
2019112106_029_2D_latlon0.1x0.1.nc
2019112106_030_2D_latlon0.1x0.1.nc
2019112106_031_2D_latlon0.1x0.1.nc
2019112106_032_2D_latlon0.1x0.1.nc
2019112106_033_2D_latlon0.1x0.1.nc
2019112106_034_2D_latlon0.1x0.1.nc
2019112106_035_2D_latlon0.1x0.1.nc
2019112106_036_2D_latlon0.1x0.1.nc
2019112106_037_2D_latlon0.1x0.1.nc
2019112106_038_2D_latlon0.1x0.1.nc
2019112106_039_2D_latlon0.1x0.1.nc
2019112106_040_2D_latlon0.1x0.1.nc
2019112106_041_2D_latlon0.1x0.1.nc
2019112106_042_2D_latlon0.1x0.1.nc
2019112106_043_2D_latlon0.1x0.1.nc
2019112106_044_2D_latlon0.1x0.1.nc
2019112106_045_2D_latlon0.1x0.1.nc
2019112106_046_2D_latlon0.1x0.1.nc
2019112106_047_2D_latlon0.1x0.1.nc
2019112106_048_2D_latlon0.1x0.1.nc
2019112112_000_2D_latlon0.1x0.1.nc
2019112112_001_2D_latlon0.1x0.1.nc
2019112112_002_2D_latlon0.1x0.1.nc
2019112112_003_2D_latlon0.1x0.1.nc
2019112112_004_2D_latlon0.1x0.1.nc
2019112112_005_2D_latlon0.1x0.1.nc
2019112112_006_2D_latlon0.1x0.1.nc
2019112112_007_2D_latlon0.1x0.1.nc
2019112112_008_2D_latlon0.1x0.1.nc
2019112112_009_2D_latlon0.1x0.1.nc
2019112112_010_2D_latlon0.1x0.1.nc
2019112112_011_2D_latlon0.1x0.1.nc
2019112112_012_2D_latlon0.1x0.1.nc
2019112112_013_2D_latlon0.1x0.1.nc
2019112112_014_2D_latlon0.1x0.1.nc
2019112112_015_2D_latlon0.1x0.1.nc
2019112112_016_2D_latlon0.1x0.1.nc
2019112112_017_2D_latlon0.1x0.1.nc
2019112112_018_2D_latlon0.1x0.1.nc
2019112112_019_2D_latlon0.1x0.1.nc
2019112112_020_2D_latlon0.1x0.1.nc
2019112112_021_2D_latlon0.1x0.1.nc
2019112112_022_2D_latlon0.1x0.1.nc
2019112112_023_2D_latlon0.1x0.1.nc
2019112112_024_2D_latlon0.1x0.1.nc
2019112112_025_2D_latlon0.1x0.1.nc
2019112112_026_2D_latlon0.1x0.1.nc
2019112112_027_2D_latlon0.1x0.1.nc
2019112112_028_2D_latlon0.1x0.1.nc
2019112112_029_2D_latlon0.1x0.1.nc
2019112112_030_2D_latlon0.1x0.1.nc
2019112112_031_2D_latlon0.1x0.1.nc
2019112112_032_2D_latlon0.1x0.1.nc
2019112112_033_2D_latlon0.1x0.1.nc
2019112112_034_2D_latlon0.1x0.1.nc
2019112112_035_2D_latlon0.1x0.1.nc
2019112112_036_2D_latlon0.1x0.1.nc
2019112112_037_2D_latlon0.1x0.1.nc
2019112112_038_2D_latlon0.1x0.1.nc
2019112112_039_2D_latlon0.1x0.1.nc
2019112112_040_2D_latlon0.1x0.1.nc
2019112112_041_2D_latlon0.1x0.1.nc
2019112112_042_2D_latlon0.1x0.1.nc
2019112112_043_2D_latlon0.1x0.1.nc
2019112112_044_2D_latlon0.1x0.1.nc
2019112112_045_2D_latlon0.1x0.1.nc
2019112112_046_2D_latlon0.1x0.1.nc
2019112112_047_2D_latlon0.1x0.1.nc
2019112112_048_2D_latlon0.1x0.1.nc

The regexr site said that it had found 18 entries for this sample data set.

htmlboss commented 4 years ago

Aside from the question, I would suggest adding a backslash to escape the final period in the expression to correctly capture the file extension:

^2019.*(00|06|12|18)_00[0-5]_2D_latlon0.1x0.1\.nc$

Since . matches any character except line breaks. There is no difference to the resulting matches but it's a little more explicit.

dwayne-hart commented 4 years ago

I had tried that last night as well. The application did not see anything.

(navigator) buildadm@u1604-on-production-b:~/ocean-nav/db$ ./nc-timestamp-mapper -n RIOPS-FC-2D-LL -i /data/eccc_forecasts_netcdf/riops_forecast/2D/LL -o /home/buildadm/ocean-nav/db -r ^2019112100_000_2D_latlon0.1x0.1\.nc --dry-run -h
---DRY RUN---
List of non-indexed files not found. Continuing with complete indexing operation...
Creating list of all .nc files in "/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/"...
No .nc files found.
Exiting...
htmlboss commented 4 years ago

I think it's an issue with the expression itself. What are the criteria it's supposed to match?

htmlboss commented 4 years ago

Try this:

^(2019).*(00|06|12|18)_00[0-5].*$

dwayne-hart commented 4 years ago

We wish to have all of 2019 for runs 00, 06, 12, and 18 with times at 000, 001, 002, 003, 004, 005

(navigator) buildadm@u1604-on-production-b:~/ocean-nav/db$ ls /data/eccc_forecasts_netcdf/riops_forecast/2D/LL/ | grep -P "^2019.*(00|06|12|18)_00[0-5]_2D_latlon0.1x0.1.nc$" | tail -10
2019112200_002_2D_latlon0.1x0.1.nc
2019112200_003_2D_latlon0.1x0.1.nc
2019112200_004_2D_latlon0.1x0.1.nc
2019112200_005_2D_latlon0.1x0.1.nc
2019112206_000_2D_latlon0.1x0.1.nc
2019112206_001_2D_latlon0.1x0.1.nc
2019112206_002_2D_latlon0.1x0.1.nc
2019112206_003_2D_latlon0.1x0.1.nc
2019112206_004_2D_latlon0.1x0.1.nc
2019112206_005_2D_latlon0.1x0.1.nc
htmlboss commented 4 years ago

Here's a test binary with the regex parser set to use the ecmascript standard and to optimize the given expression. See if this changes anything.

test-build.tar.gz

dwayne-hart commented 4 years ago

The regular expression you wished to try did not work...

(navigator) buildadm@u1604-on-production-b:~/ocean-nav/db$ ./nc-timestamp-mapper -n RIOPS-FC-2D-LL -i /data/eccc_forecasts_netcdf/riops_forecast/2D/LL/ -o /home/buildadm/ocean-nav/db -r ^(2019).(00|06|12|18)_00[0-5].$ --dry-run -h

dwayne-hart commented 4 years ago

No joy with the new binary.

(navigator) buildadm@u1604-on-production-b:~/ocean-nav/db$ ./nc-timestamp-mapper -n RIOPS-FC-2D-LL -i /data/eccc_forecasts_netcdf/riops_forecast/2D/LL/ -o /home/buildadm/ocean-nav/db -r ^(2019).*(00|06|12|18)_00[0-5].*$  --dry-run -h
-su: syntax error near unexpected token `('
(navigator) buildadm@u1604-on-production-b:~/ocean-nav/db$ ./nc-timestamp-mapper -n RIOPS-FC-2D-LL -i /data/eccc_forecasts_netcdf/riops_forecast/2D/LL -o /home/buildadm/ocean-nav/db -r ^2019112100_000_2D_latlon0.1x0.1\.nc --dry-run -h
---DRY RUN---
List of non-indexed files not found. Continuing with complete indexing operation...
Creating list of all .nc files in "/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/"...
No .nc files found.
Exiting...
(navigator) buildadm@u1604-on-production-b:~/ocean-nav/db$ ./nc-timestamp-mapper -n RIOPS-FC-2D-LL -i /data/eccc_forecasts_netcdf/riops_forecast/2D/LL -o /home/buildadm/ocean-nav/db -r ^2019112100_000_2D_latlon0.1x0.1.nc --dry-run -h
---DRY RUN---
List of non-indexed files not found. Continuing with complete indexing operation...
Creating list of all .nc files in "/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/"...
No .nc files found.
Exiting...
htmlboss commented 4 years ago

Well something is being strange since I just wrote a test program and my expression works fine:

navigator) nabil@nabil-Sky-X4:~/regex-test$ g++-9 -std=c++2a -Wall -Wextra -march=native -O3 -pedantic -Wshadow -o regex-tester main.cpp 
(navigator) nabil@nabil-Sky-X4:~/regex-test$ ./regex-tester 
Matched: 2019112100_000_2D_latlon0.1x0.1.nc
Matched: 2019112100_001_2D_latlon0.1x0.1.nc
Matched: 2019112100_002_2D_latlon0.1x0.1.nc
Matched: 2019112100_003_2D_latlon0.1x0.1.nc
Matched: 2019112100_004_2D_latlon0.1x0.1.nc
Matched: 2019112100_005_2D_latlon0.1x0.1.nc
(navigator) nabil@nabil-Sky-X4:~/regex-test$ 
#include <array>
#include <string>
#include <regex>
#include <iostream>

int main() {

    const std::array<std::string, 16> vals {
        "Text",
"2019112100_000_2D_latlon0.1x0.1.nc",
"2019112100_001_2D_latlon0.1x0.1.nc",
"2019112100_002_2D_latlon0.1x0.1.nc",
"2019112100_003_2D_latlon0.1x0.1.nc",
"2019112100_004_2D_latlon0.1x0.1.nc",
"2019112100_005_2D_latlon0.1x0.1.nc",
"2019112100_006_2D_latlon0.1x0.1.nc",
"2019112100_007_2D_latlon0.1x0.1.nc",
"2019112100_008_2D_latlon0.1x0.1.nc",
"2019112100_009_2D_latlon0.1x0.1.nc",
"2019112100_010_2D_latlon0.1x0.1.nc",
"2019112112_045_2D_latlon0.1x0.1.nc",
"2019112112_046_2D_latlon0.1x0.1.nc",
"2019112112_047_2D_latlon0.1x0.1.nc",
"2019112112_048_2D_latlon0.1x0.1.nc"
    };

    try {
        const std::regex r{"^(2019).*(00|06|12|18)_00[0-5].*$", std::regex::optimize | std::regex::ECMAScript};

        for (const auto& v : vals) {
            if (std::regex_match(v, r)) {
                std::cerr << "Matched: " << v << std::endl;
            }
        }
    }
    catch(const std::regex_error& e) {
        std::cerr << "Regex error: " << e.what() << std::endl;
        return -1;
    }
    catch(...) {
        std::cerr << "Caught unknown exception." << std::endl;
        return -1;
    }

    return 0;
}
htmlboss commented 4 years ago

I think it's safe to say there's nothing wrong with the regex and there's something else going on here:

auto crawlDirectory(const std::filesystem::path& inputDirOrIndexFile, const std::string& regex) {
    namespace fs = ::std::filesystem;
    using recursive_dir_iterator = fs::recursive_directory_iterator;

    std::vector<fs::path> paths;
    const auto options{ fs::directory_options::follow_directory_symlink };

    try {
        const std::regex r(regex, std::regex::optimize | std::regex::ECMAScript);

        for (const auto& file : recursive_dir_iterator(inputDirOrIndexFile, options)) {
            if (fs::path(file).extension() == ".nc" && std::regex_match(fs::path(file).string(), r)) {
                paths.emplace_back(file);
            }
        }
    }
    catch(const std::regex_error& e) {
        std::cerr << "Regex error: " << e.what() << std::endl;
        std::exit(EXIT_FAILURE);
    }
    catch(...) {
        std::cerr << "Caught unknown exception." << std::endl;
        std::exit(EXIT_FAILURE);
    }

    return paths;
}

So that leaves it to the recursive_dir_iterator to go over the input directory.

htmlboss commented 4 years ago

Here's a new binary that will spit out every file it finds in the input directory test-build.tar.gz It may blow up the console...

dwayne-hart commented 4 years ago

The new binary works as expected and is currently indexing 8673 files.

"/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/2019111206_047_2D_latlon0.1x0.1.nc"
"/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/2019111818_028_2D_latlon0.1x0.1.nc"
"/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/2019103006_027_2D_latlon0.1x0.1.nc"
"/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/2019101900_017_2D_latlon0.1x0.1.nc"
"/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/2019101600_048_2D_latlon0.1x0.1.nc"
Building dataset description from  8673 .nc file(s).

  27% [|||||||||||||                                        ]

If any type regular expression is used it is not able to find any files.

dwayne-hart commented 4 years ago

Interesting that when you feed your latest application a file-list it drops a core.

(navigator) buildadm@u1604-on-production-b:~/ocean-nav/db$ ./nc-timestamp-mapper -n RIOPS-FC-2D-LL -i /data/eccc_forecasts_netcdf/riops_forecast/2D/LL -o /home/buildadm/ocean-nav/db -h --file-list RIOPS-FC-2D-LL.txt
Found list of non-indexed files. Only the files contained in this list will be indexed...
Creating list of all .nc files in "RIOPS-FC-2D-LL.txt"...
Building dataset description from  1239 .nc file(s).

 100% [|||||||||||||||||||||||||||||||||||||||||||||||||||||]

Opening database...
Inserting new values into database...
Segmentation fault (core dumped)