Interop between generated bindings

dcharkes commented 2 years ago

We should support interop between 2 or more jnigen-generated packages.

_Originally posted by @mahesh-hegde in https://github.com/dart-lang/native/issues/761

mahesh-hegde commented 2 years ago

Approach 1:

type_map a la ffigen

pseudo-config

type_map:
    'com.apache.pdfbox.PDDOcument': 'pdfbox'
    'android.os.Process': 'android_os'

import_libraries:
    'pdfbox': 'package:pdfbox/pdfbox.dart'
    'android_os': 'package:android/os.dart'

Approach 2: similar to above, but infer library names by package names.

import_map:
    'com.apache.pdfbox.PDDOcument': 'pdfbox'       // library imported as `pdfbox_`
    'android.os.Process': 'package:android/os.dart'   // library imported as `os_`

Advantage: simpler config Disadvantage: less control over import names in bindings, if that's an issue.

dcharkes commented 2 years ago

Cross linking the ffigen issue:

https://github.com/dart-lang/native/issues/539

cc @mannprerak2 any suggestions from thinking about this in the ffigen context?

mahesh-hegde commented 2 years ago

As detailed in dart-lang/native#521 , We don't have transitive dependency resolution at least now. Only specified packages are traversed.

So in the scenario, {A, B, C}, B and C import A.T, both B and C have to specify A.T using import map only.

But yeah, the fundamental issue persists even if the dependencies aren't implicitly included.

If b.yaml explicitly binds B.X, A.Y and c.yaml binds C.Z, A.Y, Y from both packages isn't the same.

In jni there's no structs, everything is a fragile, weakly typed pointer so you can cast. But for the fundamental issue, I think there's no solution.

In such cases, manually refactoring A into seperate jnigen config, and linking using some import map is best bet.

dcharkes commented 2 years ago

Interaction with output format

With a mapping from Java type to generated dart binding (both approaches), the API gets tied to whether we chose to generate single file or multi file. If the multi-file has approach still has a central file that exports everything, then it doesn't matter.

Config

I like approach 2 better. I don't think we care about the imported library name in the generated file.

How about we make the keys regexes or prefixes? That way we can match all types in a package:

# Map<QualifiedJavaType, DartLibraryUrl>
imports:
  'com.apache.pdfbox.*': 'package:pdfbox/pdfbox.dart'

dcharkes commented 2 years ago

In such cases, manually refactoring A into seperate jnigen config, and linking using some import map is best bet.

Agreed. That is the only way that this will work. (Same for ffigen.)

mahesh-hegde commented 2 years ago

I suppose regex is not required, if we specify

'com.apache.pdfbox': 'package:pdfbox/pdfbox.dart'

and jnigen needs to resolve com.apache.pdfbox.pdmodel.PDDocument

it should first check for com.apache.pdfbox.pdmodel.PDDocument, no specification, then for com.apache.pdfbox.pdmodel then for com.apache.pdfbox - match found.

Any pitfalls of this approach?

dcharkes commented 2 years ago

Yeah in the context of Java a prefix might make more sense than a regex. 👌

mannprerak2 commented 2 years ago

cc @mannprerak2 any suggestions from thinking about this in the ffigen context?

For ffigen, libclang provides USR (id), which are supposed to be consistent across translation unit (individual parsing). Maybe we can export a custom yaml/json file which can contain type-map for declarations with their USR for a parsed structures (Or maybe add annotations in the generated code itself) Any new library can simply specify this (In library-import config), and since the USR would be same, we can reuse the definition similar to how type-map works.

Not sure if this is useful for jnigen.

mahesh-hegde commented 2 years ago

@mannprerak2 It's same in Java actually. The fully qualified name is supposed to be same across all libraries loaded in single process.

Does ffigen have any mechanism to export USRs automatically, or is it manually specified through type map?

mannprerak2 commented 2 years ago

Does ffigen have any mechanism to export USRs automatically, or is it manually specified through type map?

Not yet, But it should be relatively easy to add to ffigen.

dcharkes commented 2 years ago

(note to self: USR = Unified Symbol Resolution)

We should design a format.

What information should be in there?

The USIs (Unique Symbol Identifiers)
The file in which the a USI is defined. Or should it be the Uri in the package? For that we need to define where in a package and in which package (package name).

If we have multiple USIs for one Uri, we should probably nest the USIs in Uris.

symbols.yaml

symbols:
  'package:pdfbox/pdfbox.dart':
    - 'com.apache.pdfbox.*'

or

symbols:
  'package:pdfbox/pdfbox.dart':
    - 'com.apache.pdfbox.pdmodel.PDDocument'
    # ...

For jnigen it might be nicer to just do

symbols:
  'com.apache.pdfbox' : 'package:pdfbox/pdfbox.dart'

But if we spell out all symbols

symbols:
  'com.apache.pdfbox.pdmodel.PDDocument' : 'package:pdfbox/pdfbox.dart'
  'com.apache.pdfbox.pdmodel.Foo' : 'package:pdfbox/pdfbox.dart'
  'com.apache.pdfbox.pdmodel.Bar' : 'package:pdfbox/pdfbox.dart'

because we're unlikely to have multiple regexes in a single .dart.

However, for ffigen, it is a bit more likely.

One question is: should we output an exact list of symbols, or a regex (or prefix)?

A prefix from the config in jnigen is probably fine, it is unlikely to collide with other names.

However, in the ffigen context, taking the method/class include filters is likely to not work. So there we should probably output the full list of symbols.

If we'd like to keep the ffigen and jnigen symbols.yaml output consistent with each other, it should be

symbols:
  # Map<DartPackageUri, List<UniqueSymbolIdentifer>>
  'package:pdfbox/pdfbox.dart':
    - 'com.apache.pdfbox.pdmodel.PDDocument'
    - 'com.apache.pdfbox.pdmodel.Foo'
    - 'com.apache.pdfbox.pdmodel.Bar'
    # ...

wdyt?

mahesh-hegde commented 2 years ago

In java, binaryName is unambiguous and uniquely maps to a class. FQN may be ambiguous in rare cases. In jnigen we use binaryName everywhere.

I am not familiar with USRs.

Are they serializable in some way?
Are they same across different invocations / parses of same set of files?
Are they same for a given symbol across different sets of files? For example, if you have {A.h, X.h} in one ffigen config1, {C.h, X.h} in config2, is the USR for symbol x from X same in both? Assuming same system of course.

I think it should also be possible to semi-manually work around this, by having the user provide a mapping which asserts X.h in config2 is equivalent to X.h in config1.

In the rest of this comment I am assuming we have a way to identify each binding symbol in string format, and there's some hierarchical structure. Fortunately this is almost always true in Java.

Re: YAML format

I think it's logical to have a mapping from required symbol to required import and not vice versa, because we lookup by symbol, and it's generated file && read by program, even if it looks tidier to have a tree with each import listing what it provides. Of course you can convert to different internal representation with a simple for loop. So matter of preference. Should this file be human readable? Does this file's size matter?

Re: lookup

Assuming a hierarchical structure among bindings, i.e given an identifier you can tell it's parent (eg java package or header file), we can easily implement lookup in most-specific to most-general order. (As I gave an example above).

a.b -> 'package:a/b.dart';

resolve(a.b.C) will look for a.b.C first, then a.b, and matches the above spec.

Assuming b encompasses all subpackages as well, a.b.c.D will match the above import spec. Maybe if we don't want this behavior, we can use a '/' or other symbol to explicitly specify that 'a.b' is recursively encompassing, and change the lookup code accordingly. That's not a problem.

Now, assuming some classes were filtered out using config, we need to explicitly write the filtered out class with null mapping, so an explicitly skipped class a.b.X won't match the above spec. Code should stop at seeing null.

There are more implementation details to still flesh out, but I think the basic idea is sufficient.

I think regular expressions may be harder to do constant time lookup, unless maybe you sort the list of all regexes and binary search the ones closest to what you are looking for. with prefixes there will be a single digit number of combinations you have to check in a map. Is there any specific advantage of using regex in ffigen context?

But yeah above all is leetcode. If it's only a machine read-write file I don't mind writing out every type either, it will still be smaller than generated binding file. And reading some JSON/YAML should be pretty fast compared to parsing C/Java.

https://en.wikipedia.org/wiki/List_of_unsolved_problems_in_computer_science

mannprerak2 commented 2 years ago

Are they serializable in some way? Are they same across different invocations / parses of same set of files? Are they same for a given symbol across different sets of files? Assuming same system of course.

Yes to all.

mannprerak2 commented 2 years ago

Is there any specific advantage of using regex in ffigen context?

C libraries usually add a prefix to all the symbols in a library so as to prevent name collision. (E.g libclang uses clang_ for functions and CX for rest. So a regexp is really useful here to quickly target all symbols of a library .

Edit: If you meant just using it in symbol.yaml context, then I don't think regexp is of much use.

dcharkes commented 2 years ago

Re: Re: YAML format

Should this file be human readable?

It makes it easier for people to understand the whole setup. Also, people will see it in diffs when committing and in PRs etc.

Does this file's size matter?

Likely not much, I presume that everything (including git) compresses files. And repeated strings are relatively easy to compress. However, more text also means it is hard for users to read through if trouble shooting is needed. So smaller is usually better.

Of course you can convert to different internal representation with a simple for loop.

Exactly. So I prefer optimizing for simple for humans to understand in diffs etc.

tidier to have a tree

That would be slightly tidier than a map indeed.

Something like

symbols:
  'package:pdfbox/pdfbox.dart':
    'com.apache.pdfbox.pdmodel' :
      'PDDocument' : true #  equals 'com.apache.pdfbox.pdmodel.PDDocument' : true
      'Foo' : true
      'Bar' : true
      'nested' :
         'Baz' : true
       'nested2' : true
    # ...

But that requires organizing all the symbols that jnigen binds to into a tree of common-prefixes. I'm not sure if that is worth it over a flat list:

symbols:
  'package:pdfbox/pdfbox.dart':
    'com.apache.pdfbox.pdmodel.PDDocument' : true
    'com.apache.pdfbox.pdmodel.Foo' : true
    'com.apache.pdfbox.pdmodel.Bar' : true
    # ...

We'd like to have a map (or set) in the Dart runtime for quick lookup, but in yaml it just creates noise. So probably the list that I suggested is nicer.

symbols:
  'package:pdfbox/pdfbox.dart':
    - 'com.apache.pdfbox.pdmodel.PDDocument'
    - 'com.apache.pdfbox.pdmodel.Foo'
    - 'com.apache.pdfbox.pdmodel.Bar'
    # ...

dcharkes commented 2 years ago

Now, assuming some classes were filtered out using config, we need to explicitly write the filtered out class with null mapping, so an explicitly skipped class a.b.X won't match the above spec. Code should stop at seeing null.

This is a reason to not use prefixes at all in our yaml file., but only generate mappings for the full path.

dcharkes commented 2 years ago

Based on the discussions for ffigen:

We need to version the symbols file, so that we detect if it was written by a too new or too old version of jnigen.
If we have any config options that influence the if generated bindings should import each other (other than the file paths) or make it incompatible to import each other, those should be listed in the yaml as well.

format_version: 1.0.0
files:
  "../generated/base_gen.dart" :
    used_config:
      <some-symbol-that-infuences-importing-bindings>: <some-value>
    symbols:
      'package:pdfbox/pdfbox.dart':
        - 'com.apache.pdfbox.pdmodel.PDDocument'
        - 'com.apache.pdfbox.pdmodel.Foo'
        - 'com.apache.pdfbox.pdmodel.Bar'

Note that compared to https://github.com/dart-lang/native/issues/539, we don't have to include a name per unique symbol, because the last part after the last . is the symbol. (If that turns out to not be true, we would need to do the same as the ffigen config and make the unique symbols a map with a name member instead of a list.)

mahesh-hegde commented 2 years ago

If that turns out to not be true

Yes. symbols can be renamed, both for conflict resolution and probably in future, manually. name = mapping["name"] ?? getElementName(binaryName), not a problem in practice I think.

mahesh-hegde commented 2 years ago

Few notes / questions:

Depending on configuration (multifile vs single-file), we might want a more general way of specifying class names

format_version: 1.0.0
files:
"lib/src/" :
used_config:
  output_structure: package_structure
symbols:
    - 'com.apache.pdfbox.pdmodel.PDDocument'
    - 'com.apache.pdfbox.pdmodel.Foo'
    - 'com.apache.pdfbox.pdmodel.Bar'

So this means the above classes are in lib/src/ under java-esque folder tree, and imports should be resolved as such.

I am still catching upto discussions on ffigen. I may be missing few points from the discussion.

This also raises a question of lots of small imports vs few large imports (eg per package _package.dart files).

dcharkes commented 2 years ago

Okay, so we need name as well.

(Oops, I copy pasted too much. We had two file paths in there.)

I believe output_structure is compatible.

format_version: 1.0.0
files:
  'package:pdfbox/pdfbox.dart': # Everything exported from some lib/src/generated/_all.dart
    used_config: # Optional.
      # Probably empty for now.
    symbols:
      'com.apache.pdfbox.pdmodel.PDDocument':
        name : 'PDDocument'
      'com.apache.pdfbox.pdmodel.Foo':
        name : 'Foo'
      'com.apache.pdfbox.pdmodel.Bar':
        name : 'BarRenamed'

format_version: 1.0.0
files:
  'package:pdfbox/com/apache/pdfbox/pdmodel/PDDocument.dart': # Directly generated into the public API of the package.
    symbols:
      'com.apache.pdfbox.pdmodel.PDDocument':
        name : 'PDDocument'
  'package:pdfbox/com/apache/pdfbox/pdmodel/Foo.dart':
    symbols:
      'com.apache.pdfbox.pdmodel.Foo':
        name : 'Foo'
  'package:pdfbox/com/apache/pdfbox/pdmodel/Bar.dart':
    symbols:
      'com.apache.pdfbox.pdmodel.Bar':
        name : 'BarRenamed'

Both of these would work when being imported from another binding generation.

Depending on configuration (multifile vs single-file), we might want a more general way of specifying class names

We don't have to, we can keep it simple, just list the full path of all files in the yaml as I did above.

HosseinYousefi commented 1 year ago

I'll try to specify everything here before implementing the feature.

Generating the export file

Yaml path

The default path will be a top-level jnigen_export.yaml. This can be changed in output > export > path:

output:
  export:
    path: a_export.yaml

Excluding classes to export

This can be specified in export > exclude:

export:
  exclude:
    - com.example.A
    - com.example.B

The export file's structure

version: 1.0.0
files:
  'package:x/y.dart':
    classes:
      'com.example.A':
        name: A
        type_class: $AType
        super_count: 2
        type_params:
          T: 
             'java.lang.Object': DECLARED
        methods: 
          abc: 2 # in this case the method got the 2 suffix, so when extending this class, we need to make sure the corresponding method that overrides this, get the 2 suffix

super_count can be removed by me in a PR, in that case it's also going to be removed from the yaml. fields and method renames can be added in future versions.

Import

This can be specified in import. All of the ones below are supported:

import:
  - 'package:x' # will look for the default top-level jnigen_export.yaml
  - 'package:y/y_export.yaml'
  - 'path/to/z_export.yaml'

If the same class comes in a subsequent import, that one will be overriden with a warning log.

import is going to have package:jni at the beginning of it by default.

If the same class is generated by the library itself, that takes precedence over the imported ones.

Granular control

package:jni has multiple classes such as List what if we want to make sure that we have that one even if some other import overrides it?:

import:
  - 'package:java_collections' # even though this has a class for java.util.List
  - 'package:jni':  # this takes precedence
     show: 'java.util.List'

Or

import:
  - 'package:java_collections': # You can hide individual ones
    hide: 'java.util.List'

Am I missing something? Different structure or naming suggestions?

dcharkes commented 1 year ago

If the same class is generated by the library itself, that takes precedence over the imported ones.

That does mean that if the imported library has methods that take that class as an argument, they will still use the definition of that library. So we end up having to do casts on that boundary.

Maybe it's better to have an error, and having to explicitly hide. Then having to do casts because of duplicate definitions does not come as a surprise to users.

Different structure or naming suggestions?

We use "symbol-file" as the name for the yaml file in FFIgen. And symbol-file in both the import and export config. Should we consider taking that naming here as well so that we have a way to talk about the file. (It's called symbols because that's what we call the things you can look up in dlls, not sure if we should have a slightly different name in Java/Kotlin land.) Having proper name for the file instead of calling it foo_export.yaml. What is the "thing" or "things" that we import/export?

Also, why have the package name in there? That is superfluous.

import:
  - 'package:x' # will look for the default top-level jnigen_export.yaml

Oh I like! (Though we could consider not doing that and just doing the non-magic thing. E.g.analysis_options.yaml does include: package:lints/recommended.yaml. We keep the ecosystem consistent. It's not that much more typing and is crystal clear.)

classes:

Are there any other things than classes that we export/import?

(Just a large config for my own sake to see where imports fit in.)

preamble: |
  // ..

output:
  c:
    path: 'src/'
    subdir: 'third_party/'
    library_name: 'foo'
  dart:
    path: 'lib/src/third_party/'
  symbols:
    path: 'lib/symbols.yaml' # So that it gets a library uri.

classes:
  - 'org.example.foo.Foo'

maven_downloads:
  source_deps:
    - 'org.example....'

import:
  - 'package:java_collections/symbols.yaml' # Using a library uri.

Again, 'symbols' might not be the right term to describe this. But 'import' and 'export' describes the action with the "thing" not the "thing", not the contents of that yaml file.

If we have nothing else than 'classes', maybe it should be classes.yaml? Or jnigen_definitions.yaml? The thing describes the structure of the generated Dart code, toplevel 'things' in Dart are 'definitions'.

HosseinYousefi commented 1 year ago

Maybe it's better to have an error, and having to explicitly hide.

Good idea.

Though we could consider not doing that and just doing the non-magic thing.

Agreed, I think that's better. No default and having to explicitly naming it is good.

Are there any other things than classes that we export/import?

classes key is there so that if we want to add some option to a single file package:x/y.dart, we can. I don't exactly know if we need this, so maybe we can remove classes and bump up the version in case we needed it in the future.

Tangential: I think it's also useful to have some way of breaking down the generated "symbols" files into multiple ones, maybe adding an option to add sub-exports per generated library. So if a library generates the entire java stdlib, we could import only `package:java/util/util_symbols.yaml'. Also including multiple symbols files into one.

Not necessary for closing this issue but could be useful in the future.

dcharkes commented 1 year ago

Tangential: I think it's also useful to have some way of breaking down the generated "symbols" files into multiple ones, maybe adding an option to add sub-exports per generated library. So if a library generates the entire java stdlib, we could import only `package:java/util/util_symbols.yaml'. Also including multiple symbols files into one.

That's a good idea.

But indeed let's just build the simple thing first, and bump the version if we want to cover subsequent use cases.

mahesh-hegde commented 1 year ago

files: 'package:x/y.dart':

If this is relative path, you don't have to take package name x as a config option. Is there an argument in favor of including package name?

I have been inclined against package name in config, although it might have made few things easier, because jnigen should work from anywhere in hierarchy, then you will be also adding the relative path to config from package root. (Correct me if I missed something).

show: 'java.util.List'

Nice!

HosseinYousefi commented 6 months ago

It's better to change the format to json instead of yaml. The argument is the same as with native_assets, the data is going to be produced and consumed by tools not humans, so yaml is unnecessary.

dcharkes commented 6 months ago

My glorious infestation of yaml is coming to an end...

dart-lang / native