NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.33k stars 14.3k forks source link

Packaging up libpostal #38024

Open CMCDragonkai opened 6 years ago

CMCDragonkai commented 6 years ago

Issue description

I'm trying to package up https://github.com/openvenues/libpostal, which uses curl during its make, to download an external file.

Does this mean when we turn this into a Nix expression, we can either A: put the external file as a nix dependency and change the libpostal expression to rely on that external file expression, or B: bring that external file with an extra fetchUrl call within the same libpostal expression?

7c6f434c commented 6 years ago

Yes, an in both cases you get the file with fetchurl and put it where make expects it so it doesn't need to be refetched. I guess you could find yet another way to structure this; I think for a single single-use fetch it is more typical to put the fetchurl invocation in the same .nix file.

CMCDragonkai commented 6 years ago

Note that data files generated are meant to be able to be updated: https://github.com/openvenues/libpostal#data-files

Does that mean the build expression needs to put these files in a temporary location? How would one do this? Do we always expect that there's a mutable home directory with relevant data files? Or do we just fix them just like the built source code?

7c6f434c commented 6 years ago

I think the ideal case would be to build most of the code, then provide a wrapper that uses prebuilt code and a fixed predownloaded dataset to provide the functionality. If libpostal builds quickly, maybe just fixing the dataset and rebuilding every time is also fine.

CMCDragonkai commented 6 years ago

https://github.com/openvenues/libpostal/issues/341

CMCDragonkai commented 6 years ago

@7c6f434c Can you let me know whether pkgconfig will be needed in nativeBuildInputs for this particular package? It seems that only downstream packages would make use of it. The building of libpostal itself doesn't require pkgconfig.

CMCDragonkai commented 6 years ago

What is the output of fetchurl in this case? If I were to use it on the data files? Here is where they use to specify the data download: https://github.com/openvenues/libpostal/blob/master/src/libpostal_data

Note that the only output binary that libpostal creates is libpostal_data which is used to update or download new data files. In this case, I feel like this means that the data files should be a mutable location. But other then $HOME, where else can mutable data files exist on NixOS?

Alternatively, should then libpostal_data just be removed from $out/bin?

7c6f434c commented 6 years ago

At what moment is data location fixed? I guess if a user can use libpostal_data to download data to some place and tell libpostal to use this place, then the tool can stay.

No idea about pkgconfig being neccessary or not; but if there are no problems without it, then indeed it is not needed as an input.

Mathnerd314 commented 6 years ago

In theory the location is configurable at both build time and runtime. libpostal_setup_datadir will use the path configured as datadir if passed NULL and otherwise use whatever string is passed.

But in practice it seems it's really the build-time path that matters, there's a method libpostal_setup that calls libpostal_setup_data(NULL) and this is the method used by e.g. the Python bindings.

The libpostal data files aren't really mutable though; there's only one recent released version and the code opens them read-only. I'd package the data files as a fixed-output derivation that runs the downloader script. Then the data dir can be configured as that output path.

CMCDragonkai commented 6 years ago

Ok in that case I'll just make sure to remove libpostal_data simply because it won't work in this case, as the store path containing the data files will be immutable.

CMCDragonkai commented 6 years ago

Ok so I'm getting the idea of using a fixed-output derivation for the data files, in a similar vein to using fetchurl. However I'm looking at using libpostal_data as the build.sh. The problem is that this script appears non-deterministic. Look at:

https://github.com/openvenues/libpostal/blob/027fbc5afc3d825aeab04e4de79b0363b437deec/src/libpostal_data#L122-L123

Furthermore the output of this script isn't just tar.gz files, but performs extraction into a particular directory structure. How do you specify the fixed output hash for this?

They create a structure like this:

├── address_expansions
│   └── address_dictionary.dat
├── address_parser
│   ├── address_parser_crf.dat
│   ├── address_parser_phrases.dat
│   ├── address_parser_postal_codes.dat
│   └── address_parser_vocab.trie
├── data_version
├── language_classifier
│   └── language_classifier.dat
├── last_updated
├── last_updated_language_classifier
├── last_updated_parser
├── numex
│   └── numex.dat
└── transliteration
    └── transliteration.dat

5 directories, 12 files

The last_updated* files are being created according to the current time.

CMCDragonkai commented 6 years ago

Instead of using their download script. It seems I should be able to acquire the data files directly.

let
  version = "v1.0.0";
  classifierData = fetchTarball "https://github.com/openvenues/libpostal/releases/download/${version}/language_classifier.tar.gz";
  libpostalData = fetchTarball "https://github.com/openvenues/libpostal/releases/download/${version}/libpostal_data.tar.gz";
  parserData = fetchTarball "https://github.com/openvenues/libpostal/releases/download/${version}/parser.tar.gz";
in
  ...

However these 3 paths are meant to be composed into a single path with directory structure similar to above. The last_updated* shouldn't be necessary. The above expression would fetch 3 into 3 store paths. What's the right way to do this?

Also I hope fetchTarball supports redirects. Or... https://github.com/NixOS/nix/issues/520

Or I just go with the fixed output derivation like fetchurl but write my own build.sh that does the above.

7c6f434c commented 6 years ago

A custom fixed-output derivation is probably the best option.

dmjio commented 6 years ago

@CMCDragonkai I've constructed a fixed output derivation here for the libpostal address data:

   parserTarball = nixpkgs.fetchzip {                                                                                                                         
     url = "https://github.com/openvenues/libpostal/releases/download/v1.0.0/parser.tar.gz";                                                                  
     sha256 = "193fk4x0j9jwvkcva5rir3zw8nhf994q40xyv59da6mlfxpi6w9q";                                                                                         
     stripRoot = false;                                                                                                                                       
   };   

along with another derivation to wrap that

   libpostalData = nixpkgs.stdenv.mkDerivation {                                                                                                              
     name = "libpostal-data";                                                                                                                                 
     buildCommand = ''                                                                                                                                        
       mkdir -p $out/data                                                                                                                                     
       ln -s ${parserTarball}/address_parser $out/data/address_parser                                                                                         
       ln -s ${addressBase}/address_parser/transliteration $out/data/transliteration                                                                          
       ln -s ${addressBase}/address_parser/numex $out/data/numex                                                                                              
       ln -s ${addressBase}/address_parser/address_expansions $out/data/address_expansions                                                                    
     '';                                                                                                                                                      
   }; 

and finally the derivation to build libpostal

   libpostalc =                                                                                                                                               
     nixpkgs.stdenv.mkDerivation {                                                                                                                            
       name = "libpostal";                                                                                                                                    
       src = nixpkgs.fetchFromGitHub {                                                                                                                        
         owner = "openvenues";                                                                                                                                
         repo = "libpostal";                                                                                                                                  
         rev = "43795a3d903991d3864926393af10c3ec31a161c";                                                                                                    
         sha256 = "0i6ij3rjlr4zdf3yz83yadnw8cbgswhmd3717nv0br93hahjyd16";                                                                                     
       };                                                                                                                                                     
       buildInputs = with nixpkgs; [ autoreconfHook curl ];                                                                                                   
       configureFlags = [ "--datadir=${libpostalData}/data"                                                                                                   
                          "--disable-data-download"                                                                                                           
                        ];                                                                                                                                    
    };    

I'm getting exceptions though when I attempt to use it, the data directory can't be found, which is odd, since I'm specifying the directory as a configure flag (see above), and telling libpostal where the data lives.

  at libpostal_setup_datadir (libpostal.c:266) errno: No such file or directory
ERR   Could not find parser model file of known type
   at address_parser_load (address_parser.c:215) errno: No such file or directory
ERR   Error loading address parser module, dir=(null)
   at libpostal_setup_parser_datadir (libpostal.c:410) errno: No such file or directory
ERR   Error loading transliteration module, dir=(null)
   at libpostal_setup_datadir (libpostal.c:266) errno: No such file or directory
ERR   Could not find parser model file of known type
   at address_parser_load (address_parser.c:215) errno: No such file or directory
ERR   Error loading address parser module, dir=(null)
   at libpostal_setup_parser_datadir (libpostal.c:410) errno: No such file or directory
ERR   Error loading transliteration module, dir=(null)
   at libpostal_setup_datadir (libpostal.c:266) errno: No such file or directory
ERR   Could not find parser model file of known type
   at address_parser_load (address_parser.c:215) errno: No such file or directory
ERR   Error loading address parser module, dir=(null)
   at libpostal_setup_parser_datadir (libpostal.c:410) errno: No such file or directory
ERR   Error loading transliteration module, dir=(null)
   at libpostal_setup_datadir (libpostal.c:266) errno: No such file or directory
ERR   Could not find parser model file of known type
   at address_parser_load (address_parser.c:215) errno: No such file or directory
ERR   Error loading address parser module, dir=(null)
   at libpostal_setup_parser_datadir (libpostal.c:410) errno: No such file or directory
ERR   Error loading transliteration module, dir=(null)
   at libpostal_setup_datadir (libpostal.c:266) errno: No such file or directory
ERR   Could not find parser model file of known type
   at address_parser_load (address_parser.c:215) errno: No such file or directory
ERR   Error loading address parser module, dir=(null)
   at libpostal_setup_parser_datadir (libpostal.c:410) errno: No such file or directory

cc @7c6f434c @CMCDragonkai @albarrentine any thoughts?

7c6f434c commented 6 years ago

What addressBase even is?

dmjio commented 6 years ago

@7c6f434c

addressBase = nixpkgs.fetchzip {
  url = "https://github.com/openvenues/libpostal/releases/download/v1.0.0/libpostal_data.tar.gz";
  sha256 = "1hbckdqizhzznbsfgp5y2b8p074bw97kn766sfmkqmv18j98548n";
  stripRoot = false;
};
7c6f434c commented 6 years ago

I think you have extra address_parser in your symlinks into addressBase

dmjio commented 6 years ago

@7c6f434c figured it out, wasn't using the right C function. In regards to the data, ran ./libpostal_data download all datadir and then tar'd it and hosted it somewhere that won't be changed and fetched it separately with nix. Then passed path of fetched data dir to C function that takes a data dir.

CMCDragonkai commented 5 years ago

@dmjio Can you construct a PR for the work you have done and link it to this issue?

stale[bot] commented 4 years ago

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

  1. Search for maintainers and people that previously touched the related code and @ mention them in a comment.
  2. Ask on the NixOS Discourse.
  3. Ask on the #nixos channel on irc.freenode.net.
MasseGuillaume commented 2 years ago

https://github.com/NixOS/nixpkgs/pull/179613