MatthieuDartiailh / bytecode

Python module to modify bytecode
https://bytecode.readthedocs.io/
MIT License
302 stars 38 forks source link

Is there any way to get the bytecode offsets along with the line number ? #48

Open azadsalam opened 5 years ago

azadsalam commented 5 years ago

x = 0

Consider the above code snippet. The dis module outputs the line numbers along with the bytecode offsets.

  1           0 LOAD_CONST               0 (0)
              2 STORE_NAME               0 (x)
              4 LOAD_CONST               1 (None)
              6 RETURN_VALUE

Using the bytecode module I get the following, except the offset=<> part. Is there any way I can get these offsets, similar to the dis module's output?

<LOAD_CONST arg=0 lineno=1 offset=0>
<STORE_NAME arg='x' lineno=1 offset=2>
<LOAD_CONST arg=None lineno=1 offset=4>
<RETURN_VALUE lineno=1 offset=6>>

I modified the bytecode module source to get the above output. If there is no other way except modifying the source, should I submit a PR?

MatthieuDartiailh commented 5 years ago

No currently the instruction do not store their offset so it is lost. What is your use case for the offset ?

azadsalam commented 5 years ago

No currently the instructions do not store their offset so it is lost. What is your use case for the offset?

I am instrumenting python programs and offsets will allow me to locate exactly which bytecode my instrumentations correspond too.

MatthieuDartiailh commented 5 years ago

What concerns me is that as soon as you are going to modify the bytecode the offset won't be valid anymore, which means that the offsets would have to be dynamically computed. However the instruction which is the logical level to access this information does not hold a a reference to the surrounding bytecode that can compute the offset.

azadsalam commented 5 years ago

You are completely right. However, for my use case, I do need the Actual offsets, i.e., the offsets before I did any instrumentation.

MatthieuDartiailh commented 5 years ago

In that case could you use dis to get the offset ? and do something like:

frozen_b = dis.Bytecode(f)
editable_b = bytecode.bytecode(f)
for frozen_i, editable_i in zip(frozen_b, editable_b):
     # do whatever that requires the offset
azadsalam commented 5 years ago

Yes, that can be done, of course! However, what I am trying to understand is, is there any other downsides of having the original offsets as fields along with the Instr objects itself, except that if someone modifies the Instr list of Bytecode objects, the offsets for the modified code will have to be dynamically calculated and may differ from the original offsets.

From my point of view, having some way to have the original offsets within the Instr class or, in an auxiliary data structure (maybe only readable) will enhance the library and will be equivalent to the dis module, information wise.

MatthieuDartiailh commented 5 years ago

If we expose such an attribute I am mostly worried that people may complain about the fact that this attribute becomes out of sync. But I may be over thinking it. I need to look again at how we handle lineno and document it (which suffers from the same kind of issues), but it may be fine. If you want to start a PR feel free to do so. I may be a bit slow to answer but I will be sure to take a look.

azadsalam commented 5 years ago

Thank you so much, for your prompt feedback. I get your point, but as this module is written for modifying bytecodes and it is expected that once you modify the bytecodes the original offsets might not hold.

MatthieuDartiailh commented 5 years ago

Yes but we have to make it very clear in the documentation. Furthermore, we could provide methods on the bytecode to update the offset (and lineno) to the value they will take in the synthesized code object.

azadsalam commented 5 years ago

I will try to come up with the PR soon.

SnoopyLane commented 5 years ago

I have mixed feelings on this. When I was trying to address the problem is branch offsets expanding to need an additional byte, I found myself wishing many times for a code offset on the instructions.

But here's the thing: Those instructions are logical concepts and the concept does not have an intrinsic size from which you can deduce the offsets. I would use Unicode as an example. There are codepoints with values, but the size of those values depends on whether you encode as 32-bit wchar_t, or Windows UNICODE (ie UCS-2) or UTF-8, or HTML " " (6 characters); yet the codepoint itself is size-less.

I think it's completely reasonable for ConcreteInstr to have a size since it's concrete. But Instr is a logical concept: it has no size, and from there I'm not sure how it can have an offset given that it has no size.

Maybe some helper functions that would make it easier to get this information, because it can be tedious to get back to a ConcreteInstr once you've been manipulating a CFG. I would be very concerned about the information becoming wrong, though: once any instruction is changed then the offsets of anything can change, too. I would think ConcreteInstr.offset() ought to be a method so that it can embed whatever complexity is necessary to deal with that.

azadsalam commented 5 years ago

Good points. Can you/ @MatthieuDartiailh kindly provide your thoughts on the following points:

1) returning an auxiliary data structure when Bytecode is generated using the Bytecode.from_code method, mapping Instrs to the "original" offsets ( i.e., the offsets in the code object it is generated from)?

2) having a method in ConcreteBytecode, which returns the Current mapping from Instrs to offsets?

MatthieuDartiailh commented 5 years ago

I agree with @SnoopyLane that offset makes sense only on ConcreteBytecode and as already mentioned I share his concern about this getting very easily out of scope. (As I mentioned I feel that lineno suffers from mostly the same problem, Instr has a lineno but lineno is actually determined by the SetLineno meta instruction and I am wondering if lineno should not be limited to ConcreteInstr). To me it seems that having ConcreteBytecode (and ConcreteInstr) being un-mutable would solve most of those concerns. We could then put as many details as we want in the structure without worrying about it getting out of sync and people would simply have to convert to Bytecode to perform edition. However this would be an API breaking change. Any opinion on this ? @vstinner @serhiy-storchaka do you have an opinion on this question ?

SnoopyLane commented 5 years ago

Thinking out loud here. This is not a proposal.

I'm trying to imagine how I could use this, and just re-read some of @azadsalam's comments.

If I was trying to instrument bytecodes, I can imagine having an Instr and wanting to know "Where did this come from?". Maybe if an Instr contained a reference to the ConcreteInstr that it was disassembled from? And that ConcreteInstr had an offset?

That would open it's own can of worms though. Once that Instr was assembled back to a ConcreteInstr would you want to know the original ConcreteInstr that it came from? Or the new ConcreteInstr that it was assembled into? Or both?

Somebody slap me if I'm over-thinking this.

I'm not sure what the above could do for @MatthieuDartiailh's point about lineno? I kind of like the idea of Instr not having a lineno and you get it from "Instr.original_concreteinstr.lineno". But lineno is something that we want to carry along when we re-assemble. I feel like there's probably a clean answer for this and yet I'm nowhere close.

As for my own use case for offset in the past: Only as a python dev writing optimization code that uses the Bytecode API (ala Hettinger's make_constants). At a PDB prompt I've stepped through code that removes/adds/changes an Instr, and I want to look at something like a "dis" dump to get a warm fuzzy feeling that my code does what I think. And step through more changes, dump it again. And again. And again. In the past I've done that by writing a little helper method that just compiles to Concrete and dumps to stdout, and I just call that method from the PDB prompt. Really low tech and old school, and I don't even have an example to post because it looks like the helper never made it to a commit.

MatthieuDartiailh commented 5 years ago

That does seem extra complicate to me. If a user do need to keep that kind of information I would say he should convert the code object in whatever we design that hold that information. Then convert it to Bytecode. He can later convert the Bytecode to a new low-level representation if he wants to look at the offset. It feels more and more to me that we just need to expose a way to convert to/from the representation existing in the dis module to avoid duplicating code.

Regarding lineno, what bothers me currently is that the information is duplicated between the Instr, and SetLineno. Note that SetLineno is also permitted in ConcreteBytecode. I feel it would be cleaner to:

MatthieuDartiailh commented 5 years ago

ping @SnoopyLane @azadsalam

Any comment ?

laike9m commented 5 years ago

Thanks @MatthieuDartiailh for pointing me to this issue. I'm also interested in having a way to get offset easily, cause right now I have to iterate over Bytecode to calculate the offset, which is inefficient.

I read the above comments, from my understanding, the concern is mostly about whether it is appropriate to coupe [Concrete]Bytecode with the original bytecode. I don't have enough understanding of the library to give a perfect answer, but perhaps we can consider a different approach.

So, we are concerned about having things coupled, how about create a separete class specifically for handling mapping, and leave [Concrete]Bytecode unchanged.

For example, a class called BytecodeMapper

mapper = BytecodeMapper(my_func.f_code)
bc: Bytecode = mapper.get_bytecode()
cbc: ConcreteBytecode = mapper.get_concrete_bytecode()

# For offset
mapper.find_instr_by_offset(int)  # returns a Instr
mapper.find_concrete_instr_by_offset(int)  # returns a ConcreteInstr
mapper.get_offset(Union[Instr, ConcreteInstr])

# For lineno
mapper.find_instrs_by_lineno(int)  # returns a list of Instr
mapper.find_concrete_instrs_by_lineno(int)  # returns a list of ConcreteInstr
mapper.get_lineno(Union[Instr, ConcreteInstr])

You can do whatever you want with the returned Bytecode object. When get_offset or get_lineno is called, it checks whether the passed in instr actually comes from this mapper(aka is, not ==). If not, just return None.

MatthieuDartiailh commented 5 years ago

The problem I see with your suggested approach @laike9m is that there is now way to get the new offsets. I still fill that freezing ConcreteBytecode would provide some nice insurance regarding lineno and offsets. Does any of you perform manipulation on ConcreteBytecode instead of Bytecode or ControlFlowGraph ?

laike9m commented 5 years ago

@MatthieuDartiailh By "new offsets" you mean the offset after modifying Bytecode or ConcreteBytecode right? I would suggest not bother doing it, it feels hacky and may have problems? I can not say for other people, maybe someone need the offset to be updated as well, but I just don't feel this to be a legitimate use.

azadsalam commented 5 years ago

I do not use ConcereteBytecode and I think freezing the "source"/"original" offsets (the original offsets in the input code object) is a good idea.

For the offsets of a Bytecode after some manipulation, the interface should only allow getting "new" offsets via ConcereteBytecode.

Sorry for replying late.

MatthieuDartiailh commented 5 years ago

After reading the documentation again, I realized that the presence of lineno on Instr was by design with the SetLineno meta instruction being just a convenience so I am coming back on my suggestion of removing lineno from Instr. However I may consider a helper function to remove those meta instructions by updating the lineno.

This makes me actually more inclined towards @laike9m idea of having a mapper. To handle the case of looking at offset after a manipulation we could then simply allow to create a mapper from an existing Bytecode (basically through assembly and disassembly). Is anybody interested in trying to implement those ideas ?

SnoopyLane commented 5 years ago

Sorry for the late response on this. I wouldn't care if ConcreteBytecode were frozen. If anything it seems like a good thing to me.

Seems to me that for people that want a greater amount of detail than what dis provides, that ConcreteBytecode is a solution for that. But if you're trying to manipulate bytecodes then ConcreteBytecode doesn't seem that useful: hide it as an internal intermediate representation while doing from_code/to_code.

juliangilbey commented 2 years ago

FYI: I've just been looking at https://github.com/fabioz/PyDev.Debugger, and see that they have a vendored version of bytecode with this somewhat addressed: see https://github.com/fabioz/PyDev.Debugger/commit/bcb8a28669e9178f96f5d71af7259e0674acc47c (though I confess that I do not fully understand the details).