Azure / cadl-ranch

Cadl Scenarios for client generations
https://azure.github.io/cadl-ranch/
MIT License
8 stars 30 forks source link

Incorrect result of saved coverage report #121

Closed pshao25 closed 2 years ago

pshao25 commented 2 years ago

Symptom: @haolingdong-msft found the printed coverage report is broken as image See coverage file here.

Root cause: This a random issue so there is a guess of how this happen. We save coverage data for every request by await writeFile. Take a look at the comments from writeFile:

It is unsafe to use fsPromises.writeFile() multiple times on the same file without waiting for the promise to be settled.

A potential issue happens like:

  1. request1 and request2 are accepted by cadl-ranch.
  2. request1 finishes and "await writeFile with coverage data1", then it is suspended.
  3. request2 finishes and "await writeFile with coverage data2", then it is suspended.
  4. "await writeFile with coverage data2" finishes first, we save coverage data 2.
  5. "await writeFile with coverage data1" finishes later and we save coverage data 1, which is not correct.

This is the issue we might face. For the specific broken issue @haolingdong-msft met, maybe somehow "coverage data2" is longer than "coverage data1", and the "coverage data1" just replace the data in the buffer (which already has "coverage data2"). Then the json brokes.

Anyway, we might need to reconsider the implementation of saving the coverage report. Make it sync (which costs) or save file only once at the end.

haolingdong-msft commented 2 years ago

This is our CI that meets this problem: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=1877589&view=logs&j=ca395085-040a-526b-2ce8-bdc85f692774&t=3763df9f-aac5-5292-209f-31e3fd069cd1 This is the published coverage report: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=1877589&view=artifacts&pathAsName=false&type=publishedArtifacts

You will see the coverage json file was broken and check-coverage fails at parseJSON

timotheeguerin commented 2 years ago

hhm, could also just wait until the previous write is done before starting a new one.

The problem with writing at the end is: when is the end? The server being the one writing the coverage it's hard to know. Could do it when asked to shutdown but that would not work if the process get forced killed(Would be something to note that you have to make sure to not do when using the test server). You also would have to make sure to stop the server before continuing

timotheeguerin commented 2 years ago

Ok ended setting a few hooks to make sure it is written on exit. If it doesn't work let me know and how you are killing the processs so we can find a solution.

Maybe can also provide a /.admin/save-coverage endpoint to manually force write.

haolingdong-msft commented 2 years ago

Thanks @timotheeguerin for the quick fix! Personally, I think providing a /.admin/save-coverage endpoint to manually force write is more straightforward. Because as a customer, I need to first realize I have to kill the process, before I can save the coverage file. Secondly, If I want to kill the process in CI, I need to first find the process ID running cadl-ranch serve, and then kill the process, it's a bit complex. (Currently in CI, we don't manually kill the process, just let CI finish and kill the running processes automatically)

PS: we can also provide a stop server command like testserver to shutdown the server and save the file, like: cadl-ranch stop-server --coverageFile xxx. This is also very straight forward.

timotheeguerin commented 2 years ago

I guess I talked to Izzy today about it and python had a different workflow where the fix I made worked well.

What do you think about having a cadl-ranch server start and cadl-ranch server stop command that automatically start the process in the background.

Can always also do the admin endpoint with yet another command cadl-ranch server save-coverage

haolingdong-msft commented 2 years ago

Thanks Tim for the quick response. Having cadl-ranch server stop works. Does it mean we still provide --coverageFile when start the server, and use the stop command to stop the server and the file will be saved?

pshao25 commented 2 years ago

@timotheeguerin are you saying we could only start one instance of cadl-ranch in one machine? Otherwise, which one is cadl-ranch server stop going to stop?

timotheeguerin commented 2 years ago

Are you running multiple instance of the test server on the same machine at once? How are you using it?

pshao25 commented 2 years ago

Correct. We would run test in parellel and we start cadl-ranch by such command image You could see two instances of cadl ranch are started at port 64683 and 64686 separately.

timotheeguerin commented 2 years ago

Why are you not starting a single instance if you are running test on parallel with on the same machine?

And then if you are doing this wouldn’t this be the original cause of the corrupted file? 2 process trying to write together?

pshao25 commented 2 years ago

Why are you not starting a single instance if you are running test on parallel with on the same machine?

We could. It just costs and has low priority. So I don't want to do this currently. But that is a good idea and we could do that in the future.

And then if you are doing this wouldn’t this be the original cause of the corrupted file? 2 process trying to write together?

No, we .net haven't face this issue currently. It is Jave facing this and Java only start one process.

Actually my point is: we could start two instances of cadl ranch, so theoretically cadl-ranch server stop doesn't know which one to stop?

timotheeguerin commented 2 years ago

I am not proposing replacing the current serve. I am suggesting to add start and stop command as an alternative. The way I would see this is by allowing the port to be passed in the stop command which I guess you must know.

haolingdong-msft commented 2 years ago

Hi @timotheeguerin, may I know when could start and stop command be available, because currently our CI pipeline is pending on this stop command to stop the server and save the coverage file on exit (currently our pipeline can't save and publish coverage file)?

timotheeguerin commented 2 years ago

I can get that done today.

timotheeguerin commented 2 years ago

https://github.com/Azure/cadl-ranch/pull/126

haolingdong-msft commented 2 years ago

Hi @timotheeguerin, thanks for the quick fix. I saw your pull request, but I found the command cadl-ranch server stop command is not recognized in cadl-ranch cli.

$ npx cadl-ranch server stop
npm WARN config global `--global`, `--local` are deprecated. Use `--location=global` instead.
cadl-ranch <command>

Commands:
  cadl-ranch validate-scenarios <scenarios  Compile and validate all the Cadl sc
  Path>                                     enarios.
  cadl-ranch generate-scenarios-summary <s  Compile and validate all the Cadl sc
  cenariosPath>                             enarios.
  cadl-ranch serve <scenariosPath>          Serve the mock api at the given path
                                            s.
  cadl-ranch check-coverage <scenariosPath  Serve the mock api at the given path
  >                                         s.
  cadl-ranch validate-mock-apis <scenarios  Validate mock apis have all the scen
  Path>                                     arios specified
  cadl-ranch upload-manifest <scenariosPat  Upload the scenario manifest. DO NOT
  h>                                         CALL in generator.
  cadl-ranch upload-coverage                Upload the coverage report.
Options:
  --version  Show version number                                       [boolean]
  --help     Show help                                                 [boolean]
  --debug    Output debug log messages.               [boolean] [default: false]
Unknown arguments: server, stop

I checked the release of cadl ranch: https://github.com/Azure/cadl-ranch/actions/workflows/release.yml, finding the release for ‘Start and stop server commands’ failed. Not sure if this is reason of the command failure. Please kindly take a look. Thanks.

timotheeguerin commented 2 years ago

yeah seems like the publish pipeline didn't create the PR bumping the versions, should be publihsed now.