florianschanda / miss_hit

MATLAB Independent, Small & Safe, High Integrity Tools - code formatter and more
GNU General Public License v3.0
158 stars 21 forks source link

Semantic type analyser brainstorming #247

Closed PeterTillema closed 2 years ago

PeterTillema commented 2 years ago

Heya! So, I've been working on a MATLAB to Python/NumPy converter the past few months, using your execellent tools for this. As you said multiple times in issues/code, you need to get a semantic type analyser working. So, my idea was to share some ideas about analysing and code generation.

For example, what would happen if you have the following (nonsense) code:

fileDir = [pwd filesep 'dir' filesep]
if ~exist(fileDir)
    disp(fileDir(2))
end

If you translate that to Python, you would expect something like

fileDir = np.array([[os.getcwd(), os.filesep, 'dir', os.filesep]])  # note: 2D-array!
if not os.path.exists(fileDir):
    print(fileDir[1])

which of course doesn't work right, as os.path.exists can't take a NumPy array. You can change the initialization, but the array might be used as a proper array, and then the code breaks. So what are your thoughts about this?

Also, indexing. Yay. A single index in MATLAB is perfectly fine, but logically won't get the same results as with NumPy. What kind of code should be generated do you think?


Regarding my semantic type analyser, I have one main function, which looks like this:

    def visit(self, tree, relation=None):
        name = 'visit_' + tree.__class__.__name__
        if hasattr(self, name):
            getattr(self, name)(tree, relation)
        else:
            warnings.warn(
                'Couldn\'t check node ' + tree.__class__.__name__ + ' during the SemanticTypeAnalyser pass')

which visits the right node, other functions in the same class. So far, everything is fine. But as you said, it is a hard task, for example, how do you deal with types and sizes? How do you track them/store them in a tree? When are you visiting functions? How do you handle array indexing, or function calls?


I may share my entire type analyser/code generator if you would like :) So far, it produces syntactically correct (as in, whitespace, comments), but when running the output program you get a ton of errors. For example, I can't properly detect if you call a function or index an array, so it always uses () rather than [] if necessary, which obviously won't work.

So, TL;DR: I'm curious what your thoughts are about this :) (and then especially the semantic type analyser). Thanks!

RobBW commented 2 years ago

Semantic Type Analysis - brainstorming # (Issue #247)

In my attempt to understand the complexities of transpiling Matlab to Python I have found 3 useful sources of insight. Perhaps they have ideas you can use?

  1. McGill University’s Sable project in which typing of Matlab is a major factor. There is a good descriptive paper here:

Li, X., & Hendren, L. (2014). Mc2FOR demo: A tool for automatically translating MATLAB to FORTRAN 95. 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). doi:10.1109/csmr-wcre.2014.6747218

Also a thesis in which many of the best attempts so far are reviewed. It also details their methods to produce consistent results.

An interesting looking link: https://github.com/Sable/mclab-ide

Unfortunately this project has lost its leader. Prof Laurie Hendren died about 2 years ago

  1. @Matteus Bysiek’s Transpyle project https://github.com/mbdevpl/transpyleat Riken is a demo of using type extended Python AST to update legacy scientific Fortran code for HPC modelling (of supernovas!)

Background papers here: Towards Performance Portability and Modernization of FLASH via Transpilation with High-Level Intermediate Representation & IPSJ-HPC16157009.pdf & IPSJ-HPC18165038.pdf

Bysiek wrote typed-astunparse an unparser for Python 3 AST’s with type comments.

Converting Matlab to Bysiek’s extended Python AST could open up many possibilities for creators of Matlab code.

  1. A recent collection of papers on using machine learning on AST’s to infer semantics from the deep tree structures. Here:

I hope these suggestions are helpful. I have copies of the quoted papers if you cannot find them..

florianschanda commented 2 years ago

I will need to take some time to properly answer this. But on arrays, I believe in MATLAB all array indices are integral (unlike e.g. Ada) so you can just translate a(n) to a(n-1) and it should always work. 2D arrays you need to deal with in some way, but they are harder since you need to know the size. Or you translate them as lists-of-lists, then you don't.

Let me get back to you on the rest. Since there is now official interest, I will put the semantic analysis at the top of my prio list.

florianschanda commented 2 years ago

Obviously what I said doesn't work if we don't even know if something is an array or a function call. I really think I need to get sem working...

florianschanda commented 2 years ago

Also, moving this to a discussion item, since it's not really a feature/bug