Open dwayne-hart opened 4 years ago
Aside from the question, I would suggest adding a backslash to escape the final period in the expression to correctly capture the file extension:
^2019.*(00|06|12|18)_00[0-5]_2D_latlon0.1x0.1\.nc$
Since .
matches any character except line breaks. There is no difference to the resulting matches but it's a little more explicit.
I had tried that last night as well. The application did not see anything.
(navigator) buildadm@u1604-on-production-b:~/ocean-nav/db$ ./nc-timestamp-mapper -n RIOPS-FC-2D-LL -i /data/eccc_forecasts_netcdf/riops_forecast/2D/LL -o /home/buildadm/ocean-nav/db -r ^2019112100_000_2D_latlon0.1x0.1\.nc --dry-run -h
---DRY RUN---
List of non-indexed files not found. Continuing with complete indexing operation...
Creating list of all .nc files in "/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/"...
No .nc files found.
Exiting...
I think it's an issue with the expression itself. What are the criteria it's supposed to match?
Try this:
^(2019).*(00|06|12|18)_00[0-5].*$
We wish to have all of 2019 for runs 00, 06, 12, and 18 with times at 000, 001, 002, 003, 004, 005
(navigator) buildadm@u1604-on-production-b:~/ocean-nav/db$ ls /data/eccc_forecasts_netcdf/riops_forecast/2D/LL/ | grep -P "^2019.*(00|06|12|18)_00[0-5]_2D_latlon0.1x0.1.nc$" | tail -10
2019112200_002_2D_latlon0.1x0.1.nc
2019112200_003_2D_latlon0.1x0.1.nc
2019112200_004_2D_latlon0.1x0.1.nc
2019112200_005_2D_latlon0.1x0.1.nc
2019112206_000_2D_latlon0.1x0.1.nc
2019112206_001_2D_latlon0.1x0.1.nc
2019112206_002_2D_latlon0.1x0.1.nc
2019112206_003_2D_latlon0.1x0.1.nc
2019112206_004_2D_latlon0.1x0.1.nc
2019112206_005_2D_latlon0.1x0.1.nc
Here's a test binary with the regex parser set to use the ecmascript standard and to optimize the given expression. See if this changes anything.
The regular expression you wished to try did not work...
(navigator) buildadm@u1604-on-production-b:~/ocean-nav/db$ ./nc-timestamp-mapper -n RIOPS-FC-2D-LL -i /data/eccc_forecasts_netcdf/riops_forecast/2D/LL/ -o /home/buildadm/ocean-nav/db -r ^(2019).(00|06|12|18)_00[0-5].$ --dry-run -h
No joy with the new binary.
(navigator) buildadm@u1604-on-production-b:~/ocean-nav/db$ ./nc-timestamp-mapper -n RIOPS-FC-2D-LL -i /data/eccc_forecasts_netcdf/riops_forecast/2D/LL/ -o /home/buildadm/ocean-nav/db -r ^(2019).*(00|06|12|18)_00[0-5].*$ --dry-run -h
-su: syntax error near unexpected token `('
(navigator) buildadm@u1604-on-production-b:~/ocean-nav/db$ ./nc-timestamp-mapper -n RIOPS-FC-2D-LL -i /data/eccc_forecasts_netcdf/riops_forecast/2D/LL -o /home/buildadm/ocean-nav/db -r ^2019112100_000_2D_latlon0.1x0.1\.nc --dry-run -h
---DRY RUN---
List of non-indexed files not found. Continuing with complete indexing operation...
Creating list of all .nc files in "/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/"...
No .nc files found.
Exiting...
(navigator) buildadm@u1604-on-production-b:~/ocean-nav/db$ ./nc-timestamp-mapper -n RIOPS-FC-2D-LL -i /data/eccc_forecasts_netcdf/riops_forecast/2D/LL -o /home/buildadm/ocean-nav/db -r ^2019112100_000_2D_latlon0.1x0.1.nc --dry-run -h
---DRY RUN---
List of non-indexed files not found. Continuing with complete indexing operation...
Creating list of all .nc files in "/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/"...
No .nc files found.
Exiting...
Well something is being strange since I just wrote a test program and my expression works fine:
navigator) nabil@nabil-Sky-X4:~/regex-test$ g++-9 -std=c++2a -Wall -Wextra -march=native -O3 -pedantic -Wshadow -o regex-tester main.cpp
(navigator) nabil@nabil-Sky-X4:~/regex-test$ ./regex-tester
Matched: 2019112100_000_2D_latlon0.1x0.1.nc
Matched: 2019112100_001_2D_latlon0.1x0.1.nc
Matched: 2019112100_002_2D_latlon0.1x0.1.nc
Matched: 2019112100_003_2D_latlon0.1x0.1.nc
Matched: 2019112100_004_2D_latlon0.1x0.1.nc
Matched: 2019112100_005_2D_latlon0.1x0.1.nc
(navigator) nabil@nabil-Sky-X4:~/regex-test$
#include <array>
#include <string>
#include <regex>
#include <iostream>
int main() {
const std::array<std::string, 16> vals {
"Text",
"2019112100_000_2D_latlon0.1x0.1.nc",
"2019112100_001_2D_latlon0.1x0.1.nc",
"2019112100_002_2D_latlon0.1x0.1.nc",
"2019112100_003_2D_latlon0.1x0.1.nc",
"2019112100_004_2D_latlon0.1x0.1.nc",
"2019112100_005_2D_latlon0.1x0.1.nc",
"2019112100_006_2D_latlon0.1x0.1.nc",
"2019112100_007_2D_latlon0.1x0.1.nc",
"2019112100_008_2D_latlon0.1x0.1.nc",
"2019112100_009_2D_latlon0.1x0.1.nc",
"2019112100_010_2D_latlon0.1x0.1.nc",
"2019112112_045_2D_latlon0.1x0.1.nc",
"2019112112_046_2D_latlon0.1x0.1.nc",
"2019112112_047_2D_latlon0.1x0.1.nc",
"2019112112_048_2D_latlon0.1x0.1.nc"
};
try {
const std::regex r{"^(2019).*(00|06|12|18)_00[0-5].*$", std::regex::optimize | std::regex::ECMAScript};
for (const auto& v : vals) {
if (std::regex_match(v, r)) {
std::cerr << "Matched: " << v << std::endl;
}
}
}
catch(const std::regex_error& e) {
std::cerr << "Regex error: " << e.what() << std::endl;
return -1;
}
catch(...) {
std::cerr << "Caught unknown exception." << std::endl;
return -1;
}
return 0;
}
I think it's safe to say there's nothing wrong with the regex and there's something else going on here:
auto crawlDirectory(const std::filesystem::path& inputDirOrIndexFile, const std::string& regex) {
namespace fs = ::std::filesystem;
using recursive_dir_iterator = fs::recursive_directory_iterator;
std::vector<fs::path> paths;
const auto options{ fs::directory_options::follow_directory_symlink };
try {
const std::regex r(regex, std::regex::optimize | std::regex::ECMAScript);
for (const auto& file : recursive_dir_iterator(inputDirOrIndexFile, options)) {
if (fs::path(file).extension() == ".nc" && std::regex_match(fs::path(file).string(), r)) {
paths.emplace_back(file);
}
}
}
catch(const std::regex_error& e) {
std::cerr << "Regex error: " << e.what() << std::endl;
std::exit(EXIT_FAILURE);
}
catch(...) {
std::cerr << "Caught unknown exception." << std::endl;
std::exit(EXIT_FAILURE);
}
return paths;
}
So that leaves it to the recursive_dir_iterator
to go over the input directory.
Here's a new binary that will spit out every file it finds in the input directory test-build.tar.gz It may blow up the console...
The new binary works as expected and is currently indexing 8673 files.
"/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/2019111206_047_2D_latlon0.1x0.1.nc"
"/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/2019111818_028_2D_latlon0.1x0.1.nc"
"/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/2019103006_027_2D_latlon0.1x0.1.nc"
"/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/2019101900_017_2D_latlon0.1x0.1.nc"
"/data/eccc_forecasts_netcdf/riops_forecast/2D/LL/2019101600_048_2D_latlon0.1x0.1.nc"
Building dataset description from 8673 .nc file(s).
27% [||||||||||||| ]
If any type regular expression is used it is not able to find any files.
Interesting that when you feed your latest application a file-list it drops a core.
(navigator) buildadm@u1604-on-production-b:~/ocean-nav/db$ ./nc-timestamp-mapper -n RIOPS-FC-2D-LL -i /data/eccc_forecasts_netcdf/riops_forecast/2D/LL -o /home/buildadm/ocean-nav/db -h --file-list RIOPS-FC-2D-LL.txt
Found list of non-indexed files. Only the files contained in this list will be indexed...
Creating list of all .nc files in "RIOPS-FC-2D-LL.txt"...
Building dataset description from 1239 .nc file(s).
100% [|||||||||||||||||||||||||||||||||||||||||||||||||||||]
Opening database...
Inserting new values into database...
Segmentation fault (core dumped)
I tested the regex by using the https://regexr.com site.
Expresion: ^2019.*(00|06|12|18)_00[0-5]_2D_latlon0.1x0.1.nc$
The regexr site said that it had found 18 entries for this sample data set.