Closed chriseth closed 4 years ago
This could either be an additional source in the json-io, or it could be a field in the output part alongside evm
, ir
and so on - maybe called internalRoutines
. We should allow for more than a single routine and those should get identifiers such that they can be accessed from source maps.
We should think about also supporting yul codegen in the future - source references should be able to reference yul ir.
@yann300 @LianaHus @haltman-at we are planning to include the yul sources of internal routines (mostly the ABI encoder) in the compiler output with the goal that source references to these routines work in the bytecode. This would allow stepping through the ABI encoder in the yul source code.
The current best idea is to add another output field in the json-io called something like ```
"internalRoutines": [ {"id": <sourceID>, "content": <yul code as string> }]
Do you have any comments on that? The source id would be just another id referenced from the source references (for most of the source references that reference id "-1
"), but a different than those used for regular solidity sources.
OK. Sounds workable. As I said earlier, my main concern is just that the IDs should be distinguishable from those of regular Solidity sources (e.g. always being negative or something).
@haltman-at "always negative" turns out to be a little difficult. Do you need those sources to be distinguishable from their ID? It will be clear that they are not Solidity from the way they are returned from the compiler. I will have a draft soon.
OK, directly from the ID would be nice but I guess I'll see just how they're returned.
Note when I say "not Solidity" I really mean "not user input" (i.e. not user input Yul either).
I guess my real concern here is that Truffle Debugger often runs off of Truffle artifacts rather than compiler output directly, and it's not obvious to me whether this is something that can be incorporated into Truffle artifacts (probably it can be, we can always add things, but the format does have some real limitations :-/ ). So for that reason it'd be nice if we could still continue to distinguish user code from internal code even should we not have access to this information (robustness! :) ), just like currently we do this by checking for -1
.
Worst case we could instead check, do we have user source for this id; that's not the same, obviously (given that source could be missing), but it should probably be good enough. The cases where this really matters are pretty narrow tbh so we can probably figure out something workable regardless.
The actual case where this most matters is for keeping track of mapping keys when the expression used for the key is a calldata array access or calldata struct member access. As in, calldata uint[] arr
, mapping(uint => uint) map
, map[arr[0]] = 1
, we need to determine arr[0]
and associate it to map
. Distinguishing user source from internal source comes up here when attempting to locate the value of arr[0]
on the stack, since we use the distinction to look ahead to the correct trace step by skipping over internal source. Pretty niche, I know! And probably not hard to preserve, given that nothing too funny is going to happen while computing a calldata array access, so there are probably multiple ways to achieve that. But that's actually the reason I care, FWIW!
The index will be stored as a string, so it takes as many characters to store -1
as it does to store g1
(g for "generated") and it would be more descriptive. Would non-integers be also OK here?
If we go for strings, I'd make it something proper like internal-1
, internal-2
, ...
And this would only affect parsing the compiler output, because users can always just translate internal-*
to -1
or internal-<n>
to -<n>
themselves to use in their further internal representation, can't they?
This "index" is used in the source reference string which should be rather small, so I would prefer a two-character index over some longer string.
Sure, non-integer works. As long as it can be distinguished without further information that would be nice. Thank you!
Actually I think non-integer is problematic, since we already use integers for source ids in various json outputs.
@haltman-at @yann300 @LianaHus @alcuadrado @iamdefinitelyahuman do you want to try out this build: https://396213-40892817-gh.circle-artifacts.com/0/soljson.js
It adds "generatedSources" to the bytecode output part of standard-json as shown in this example: https://github.com/ethereum/solidity/pull/9053/files#diff-57268994c6aef50f315c432ecd7096de
The source references will point to this file (with the source index referencing the id) and it should allow you to show the actual yul source code of the internal routines during debugging. Would really love to get your feedback on this one!
Thanks for keeping us updated about this. We'll take a look at this soon. When will this be released?
Also, I'm tagging @fvictorio here, who also works on Buidler now.
We are waiting for some feedback about whether the structure is useful or if we should change the IDs.
Interesting, thanks! I'm a bit confused on how the IDs work though. So these things will just get ordinary, whole-number IDs? There won't be any way to tell them apart from other sources by ID alone, if we don't have this info?
I mean, I can work with that, but it is something I will have to account for...
Edit: Also, I'm guessing ID -1 will still be used for those things that are still neither Solidity nor Yul?
@haltman-at yes, that's mostly what I wanted to get feedback on. I can make the IDs negative, but then we don't have any more room for other categories. I can turn them into strings which would allow for more prefixes. I can keep them integers in the AST for proper sources but use strings for the sources of internal routines. Note that the source mappings are just strings themselves, so you will not be able to tell the difference, but you have to take care when interpreting the components of the source mappings.
Hey @chriseth, I ran all buidler tests with that compiler and I got an error with one of the test cases. This is the code:
The error is:
Error: Failed to compile: Internal exception in StandardCompiler::compile: /solidity/libsolidity/interface/CompilerStack.cpp(592): Throw in function auto solidity::frontend::CompilerStack::generatedSources(const std::__2::string &, bool)::(anonymous class)::operator()() const
Dynamic exception type: boost::wrapexcept<solidity::langutil::InternalCompilerError>
std::exception::what:
[solidity::util::tag_comment*] =
@haltman-at yes, that's mostly what I wanted to get feedback on. I can make the IDs negative, but then we don't have any more room for other categories. I can turn them into strings which would allow for more prefixes. I can keep them integers in the AST for proper sources but use strings for the sources of internal routines. Note that the source mappings are just strings themselves, so you will not be able to tell the difference, but you have to take care when interpreting the components of the source mappings.
Yes, it would be ideal if there were some way to tell the difference even without having the generated source info; it doesn't have to be negatives. Strings should be fine as long as they don't use colons or seimicolons, right?
Of course, any way of doing things will break our present code I'm pretty sure (can't run the tests at the moment for technical reasons, sorry, should be fixed shortly), but, I guess if this is happening I'm just going to have to account for that.
Also, I notice the generatedSources output doesn't include ASTs for the generated sources. Would it be possible to get those by any chance? Thank you!
They would have to be yul asts if that's fine.
@fvictorio I think that is because of the interface. I'll fix it next week.
They would have to be yul asts if that's fine.
Yes, of course; Truffle Debugger can handle those. It might need some minor adjustment but that should be easy enough.
I took a first look at the output, and it looks really promising.
I'm not sure if I share/understand the concern about the IDs, but haven't thought enough about it. As Buidler will have to be adapted for this, anything is clear and let's use distinguish user and autogenerated files is fine.
Also, I notice the generatedSources output doesn't include ASTs for the generated sources. Would it be possible to get those by any chance? Thank you!
They would have to be yul asts if that's fine.
This would be very beneficial!
@chriseth how does the id of generated sources work? For example, if there are two files in sources
with ids 0 and 1, and there are two contracts with generated sources, what will those ids be? 2 and 3? 2 and 2?
Put another way, will the ids be unique for all the compilation output?
Oh, disregard my comment, that's exactly what you were discussing before.
I think being able to tell apart between normal and "internal" ids would be useful, but not strictly necessary because you can infer it from the output (assuming they are indeed unique). In case they will be differentiated, my thoughts are:
i1
, i2
, etc) seems like a good compromise. This will break everything that assumes ids are numbers (like buidler), but it shouldn't be that hard to adapt.I think the question in my previous comment still holds, though. If, for example, ids like i1
, i2
are used, how should they be scoped? Different bytecode entries can have different sources with id i1
? My gut feeling is that they should be globally unique, for simplicity and maybe for hoisting common internal sources (I don't know if this makes sense though).
To chime in here to support @haltman-at's point, I'm strongly against clobbering the non-negative numbers namespace with internal sources. It seems like the argument of "we must reserve the negatives for future internal use!" doesn't hold weight if you're comparing it to the alternative of clobbering an existing, user-facing namespace.
Out of UX concerns, there's definitely good reason to make it easy to identify what's user-defined and what isn't... Truffle does this in several places, and even though (to @fvictorio's point) it's not "strictly necessary because you can infer it", I'd really prefer to avoid the need to do this kind of inference ourselves.
Using a single letter prefix is fine for Truffle's cases as well: we can handle that with minimal disruption in a subsequent release. I'd support that solution if that's the path forward.
Just my Ξ0.002!
Hm, just tried this out and was surprised to find I couldn't get it to cause the problems I was expecting it to. That said, just because it didn't doesn't mean that it can't. But, the more I think about it, the less sure I am about my earlier claim that these sources should have IDs that are somehow distinguished... I can probably handle things in a way that doesn't rely on that?
Also, going negative or non-numeric would cause other problems I didn't think about earlier, so maybe this way is fine...
So maybe to reiterate: The source IDs are mainly important for the source references. If you process source references, you see a source ID. You do a lookup of that source ID to see what source file it refers to. If the source is compiler-generated, then its ID will just not appear in the list of source files known to you. So I don't think it is a big deal to distinguish these. Note that we actually also had these source references to internal routines in previous versions of Solidity - they were -1
and everyone should have been able to process this "ID not in list" case already.
Here is another binary: https://github.com/ethereum/solidity/pull/9053#issuecomment-676597122 It would be great if everyone could take a look!
Here is another binary: #9053 (comment)
I just add a chance to try it out, it works as expected in Remix. From what I see, this is fine for us to have non negative number for generated sources. The only reason I see for having non-numeric value to easily distinguish between user defined code and generated code, is that if the generated code is not available (which I guess is unlikely to happen), dev tools can report back to the user that "the current bytecode being debugged comes from a generated source, but that the source cannot be displayed".
Well I don't think that's really nice to have in the source maps non numeric values and numeric values for the same slot... Maybe the source maps can be extended with a new property to reflect from where the byte code is generated (user code, yul generated code, sub routine?, ...).
Here is another binary: #9053 (comment)
Buidler tests pass with this binary.
@yann300 I don't think passing this information to the part that generates the source maps is that easy. Furthermore, strictly, it is redundant information, because it is there in the other artefacts.
Was anyone able to work with the source references to the generated sources? is anyone able to plug this into their debugger easily so we can step into the generated sources?
Do you want this to be merged into the next release even if it is not properly tested yet?
Was anyone able to work with the source references to the generated sources? is anyone able to plug this into their debugger easily so we can step into the generated sources?
We could try to do something, will let you know soon
is anyone able to plug this into their debugger easily so we can step into the generated sources
You can test it out with: https://ipfsgw.komputing.org/ipfs/QmRfe32WNq8yZ6TvLDiKPtnJDZbVxC7woGL1mQ4Kfnesis (need to manually load soljson though)
@yann300 impressive, good job! The highlights are correct, but I noticed the following things - can you check if this is a bug in the compiler or in remix (or a missing feature):
Another thing I noticed is that the compiler always outputs the non-optimized Yul code, even if the optimizer is activated. This means that the source references sometimes are confusing and the control flow will "jump around". Would it be better to output the optimized yul code if the optimizer is activated? Note that one of the goals of the optimizer is to generated code that is still readable, so it is still very much different from an opcode stream.
If you process source references, you see a source ID. You do a lookup of that source ID to see what source file it refers to. If the source is compiler-generated, then its ID will just not appear in the list of source files known to you. So I don't think it is a big deal to distinguish these.
@chriseth could you provide a bit more information on how the compiler will output sources? I am strongly opposed to reusing the positive numbers namespace for internal sources because I estimate that the work on Truffle's side to support this model change could take up to a month of work+discussion.
@gnidan if that's important for you, I can check again if the i0
scheme would also work.
@chriseth I'll provide a bit more context on where I'm coming from:
We've been working on modeling compilation outputs, and we have a bunch of code + internal knowledge around the idea that source indexes can be modeled as a sparsely-populated list, where non-null indexes of that list correspond to user-defined sources. There's already a bunch of code that leans on this assumption - getting the team on board to revise our existing model and associated code is not something I look forward to!
Of course, if it's not possible on Solidity's side (or if the i0
approach negatively impacts other tools more), I'm happy to concede to whatever approach has the most utility / least cost for the most people!
(Thank you for your consideration!)
I think I should clarify here -- discussing this with @gnidan (please correct me if I'm summing this up wrong), I think whether positive numbers are OK depends on the exact form of the compiler output. If the generated sources are just completely mixed in with user sources, it would be a problem. If however they're in a separate area you have to explicitly inspect, I think there's a good chance they're fine. But again it depends on the details.
(I was on vacation last week and so haven't yet had a chance to try this out myself, I'm afraid! But I'll see what it looks like shortly, I guess.)
Personally I'm wondering, if you do take the whole-number approach, if it might be possible to have generated sources have source IDs at the end, after all the other sources... but my reason for wanting that might actually not be valid, I need to actually try this out and see the details...
The linked binary outputs the generated sources at a completely different place (the input sources are not output anyway...) and also the Yul AST is at a different place than the Solidity AST if requested. The input source IDs are contiguous, followed by the generated source IDs.
@haltman-at here is an example input (followed by the output): https://github.com/ethereum/solidity/pull/9053/files#diff-3fb8608d3a06316322aec56512c86095
Oh, wonderful! Again I'll have to try it out but I think that should be pretty workable with what Truffle does! (@gnidan can hopefully correct me if I'm wrong here).
Ooh nice! Yeah, after looking at this and discussing with @haltman-at, I think we can handle this without the significant impact I feared. I rescind my prior strong objection!
So how does the pendulum swing? Anyone rather for something like i1
? Anyone rather for just higher regular non-negative numbers?
If nobody says anything, my laziness kicks in and we stay with non-negative numbers :)
Do we plan to also provide a map from the internal routines to the solidity source location or source locations (if any) that led to it being generated?
Source references already map to internal code - we should provide access to this code, at least the abi encoder / decoder written in yul. This does not need to be part of the metadata because it can be re-generated from the source.