Open CMCDragonkai opened 6 years ago
Yes, an in both cases you get the file with fetchurl
and put it where make
expects it so it doesn't need to be refetched. I guess you could find yet another way to structure this; I think for a single single-use fetch it is more typical to put the fetchurl
invocation in the same .nix
file.
Note that data files generated are meant to be able to be updated: https://github.com/openvenues/libpostal#data-files
Does that mean the build expression needs to put these files in a temporary location? How would one do this? Do we always expect that there's a mutable home directory with relevant data files? Or do we just fix them just like the built source code?
I think the ideal case would be to build most of the code, then provide a wrapper that uses prebuilt code and a fixed predownloaded dataset to provide the functionality. If libpostal builds quickly, maybe just fixing the dataset and rebuilding every time is also fine.
@7c6f434c Can you let me know whether pkgconfig will be needed in nativeBuildInputs
for this particular package? It seems that only downstream packages would make use of it. The building of libpostal
itself doesn't require pkgconfig
.
What is the output of fetchurl
in this case? If I were to use it on the data files? Here is where they use to specify the data download: https://github.com/openvenues/libpostal/blob/master/src/libpostal_data
Note that the only output binary that libpostal
creates is libpostal_data
which is used to update or download new data files. In this case, I feel like this means that the data files should be a mutable location. But other then $HOME, where else can mutable data files exist on NixOS?
Alternatively, should then libpostal_data
just be removed from $out/bin
?
At what moment is data location fixed? I guess if a user can use libpostal_data
to download data to some place and tell libpostal
to use this place, then the tool can stay.
No idea about pkgconfig
being neccessary or not; but if there are no problems without it, then indeed it is not needed as an input.
In theory the location is configurable at both build time and runtime. libpostal_setup_datadir will use the path configured as datadir if passed NULL and otherwise use whatever string is passed.
But in practice it seems it's really the build-time path that matters, there's a method libpostal_setup that calls libpostal_setup_data(NULL) and this is the method used by e.g. the Python bindings.
The libpostal data files aren't really mutable though; there's only one recent released version and the code opens them read-only. I'd package the data files as a fixed-output derivation that runs the downloader script. Then the data dir can be configured as that output path.
Ok in that case I'll just make sure to remove libpostal_data
simply because it won't work in this case, as the store path containing the data files will be immutable.
Ok so I'm getting the idea of using a fixed-output derivation for the data files, in a similar vein to using fetchurl
. However I'm looking at using libpostal_data
as the build.sh. The problem is that this script appears non-deterministic. Look at:
Furthermore the output of this script isn't just tar.gz
files, but performs extraction into a particular directory structure. How do you specify the fixed output hash for this?
They create a structure like this:
├── address_expansions
│ └── address_dictionary.dat
├── address_parser
│ ├── address_parser_crf.dat
│ ├── address_parser_phrases.dat
│ ├── address_parser_postal_codes.dat
│ └── address_parser_vocab.trie
├── data_version
├── language_classifier
│ └── language_classifier.dat
├── last_updated
├── last_updated_language_classifier
├── last_updated_parser
├── numex
│ └── numex.dat
└── transliteration
└── transliteration.dat
5 directories, 12 files
The last_updated*
files are being created according to the current time.
Instead of using their download script. It seems I should be able to acquire the data files directly.
let
version = "v1.0.0";
classifierData = fetchTarball "https://github.com/openvenues/libpostal/releases/download/${version}/language_classifier.tar.gz";
libpostalData = fetchTarball "https://github.com/openvenues/libpostal/releases/download/${version}/libpostal_data.tar.gz";
parserData = fetchTarball "https://github.com/openvenues/libpostal/releases/download/${version}/parser.tar.gz";
in
...
However these 3 paths are meant to be composed into a single path with directory structure similar to above. The last_updated*
shouldn't be necessary. The above expression would fetch 3 into 3 store paths. What's the right way to do this?
Also I hope fetchTarball
supports redirects. Or... https://github.com/NixOS/nix/issues/520
Or I just go with the fixed output derivation like fetchurl but write my own build.sh
that does the above.
A custom fixed-output derivation is probably the best option.
@CMCDragonkai I've constructed a fixed output derivation here for the libpostal address data:
parserTarball = nixpkgs.fetchzip {
url = "https://github.com/openvenues/libpostal/releases/download/v1.0.0/parser.tar.gz";
sha256 = "193fk4x0j9jwvkcva5rir3zw8nhf994q40xyv59da6mlfxpi6w9q";
stripRoot = false;
};
along with another derivation to wrap that
libpostalData = nixpkgs.stdenv.mkDerivation {
name = "libpostal-data";
buildCommand = ''
mkdir -p $out/data
ln -s ${parserTarball}/address_parser $out/data/address_parser
ln -s ${addressBase}/address_parser/transliteration $out/data/transliteration
ln -s ${addressBase}/address_parser/numex $out/data/numex
ln -s ${addressBase}/address_parser/address_expansions $out/data/address_expansions
'';
};
and finally the derivation to build libpostal
libpostalc =
nixpkgs.stdenv.mkDerivation {
name = "libpostal";
src = nixpkgs.fetchFromGitHub {
owner = "openvenues";
repo = "libpostal";
rev = "43795a3d903991d3864926393af10c3ec31a161c";
sha256 = "0i6ij3rjlr4zdf3yz83yadnw8cbgswhmd3717nv0br93hahjyd16";
};
buildInputs = with nixpkgs; [ autoreconfHook curl ];
configureFlags = [ "--datadir=${libpostalData}/data"
"--disable-data-download"
];
};
I'm getting exceptions though when I attempt to use it, the data
directory can't be found, which is odd, since I'm specifying the directory as a configure flag (see above), and telling libpostal
where the data lives.
at libpostal_setup_datadir (libpostal.c:266) errno: No such file or directory
ERR Could not find parser model file of known type
at address_parser_load (address_parser.c:215) errno: No such file or directory
ERR Error loading address parser module, dir=(null)
at libpostal_setup_parser_datadir (libpostal.c:410) errno: No such file or directory
ERR Error loading transliteration module, dir=(null)
at libpostal_setup_datadir (libpostal.c:266) errno: No such file or directory
ERR Could not find parser model file of known type
at address_parser_load (address_parser.c:215) errno: No such file or directory
ERR Error loading address parser module, dir=(null)
at libpostal_setup_parser_datadir (libpostal.c:410) errno: No such file or directory
ERR Error loading transliteration module, dir=(null)
at libpostal_setup_datadir (libpostal.c:266) errno: No such file or directory
ERR Could not find parser model file of known type
at address_parser_load (address_parser.c:215) errno: No such file or directory
ERR Error loading address parser module, dir=(null)
at libpostal_setup_parser_datadir (libpostal.c:410) errno: No such file or directory
ERR Error loading transliteration module, dir=(null)
at libpostal_setup_datadir (libpostal.c:266) errno: No such file or directory
ERR Could not find parser model file of known type
at address_parser_load (address_parser.c:215) errno: No such file or directory
ERR Error loading address parser module, dir=(null)
at libpostal_setup_parser_datadir (libpostal.c:410) errno: No such file or directory
ERR Error loading transliteration module, dir=(null)
at libpostal_setup_datadir (libpostal.c:266) errno: No such file or directory
ERR Could not find parser model file of known type
at address_parser_load (address_parser.c:215) errno: No such file or directory
ERR Error loading address parser module, dir=(null)
at libpostal_setup_parser_datadir (libpostal.c:410) errno: No such file or directory
cc @7c6f434c @CMCDragonkai @albarrentine any thoughts?
What addressBase
even is?
@7c6f434c
addressBase = nixpkgs.fetchzip {
url = "https://github.com/openvenues/libpostal/releases/download/v1.0.0/libpostal_data.tar.gz";
sha256 = "1hbckdqizhzznbsfgp5y2b8p074bw97kn766sfmkqmv18j98548n";
stripRoot = false;
};
I think you have extra address_parser
in your symlinks into addressBase
@7c6f434c figured it out, wasn't using the right C function. In regards to the data, ran ./libpostal_data download all datadir
and then tar'd it and hosted it somewhere that won't be changed and fetched it separately with nix. Then passed path of fetched data dir to C function that takes a data dir.
@dmjio Can you construct a PR for the work you have done and link it to this issue?
Thank you for your contributions.
This has been automatically marked as stale because it has had no activity for 180 days.
If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.
Here are suggestions that might help resolve this more quickly:
Issue description
I'm trying to package up https://github.com/openvenues/libpostal, which uses
curl
during its make, to download an external file.Does this mean when we turn this into a Nix expression, we can either A: put the external file as a nix dependency and change the libpostal expression to rely on that external file expression, or B: bring that external file with an extra
fetchUrl
call within the same libpostal expression?