catapult-project / catapult

Deprecated Catapult GitHub. Please instead use http://crbug.com "Speed>Benchmarks" component for bugs and https://chromium.googlesource.com/catapult for downloading and editing source code..
https://chromium.googlesource.com/catapult
BSD 3-Clause "New" or "Revised" License
1.93k stars 564 forks source link

Migrate tracing/test_data out of Catapult #1933

Open paulirish opened 8 years ago

paulirish commented 8 years ago

On my macbook air I run out of disk space pretty often.

I then start digging into my disk space treemap and eventually find that catapult is half the size of Blink. And half of catapult is the test_data folder.

image (the commas are weird, but thats 208 MB)

I figure we could probably clean that up a bit.

Can test_data be excluded from the deps roll? Alternatively, we could at least turn the json files into .json.gz

natduca commented 8 years ago

@anniesullie @Apeliotes can one of you help? This is definitely a problem, I think we need to stop checking-in test data (and probably purge it from githistory so the repo mirror isn't huge).

anniesullie commented 8 years ago

@Apeliotes sounds like the binary dependency manager can help here. I can help you find the relevant tests if you need help.

FYI, here is the output of $ du -m tracing/test_data/* | sort -n -k 1 that are bigger than 1M:

2   tracing/test_data/battor.zip
2   tracing/test_data/google_highlight.json
2   tracing/test_data/perf_sampling_trace_with_trace_events.json.gz
2   tracing/test_data/tcmalloc_trace.json
3   tracing/test_data/lthi_cats.json.gz
3   tracing/test_data/v8.log
4   tracing/test_data/android_systrace.html
4   tracing/test_data/multiple_input_latency.json
5   tracing/test_data/windows_etw_cswitch.json
6   tracing/test_data/ddms_calculator_start.trace
6   tracing/test_data/perf_sampling_trace.json
8   tracing/test_data/sfgate.json
8   tracing/test_data/wtf.json
10  tracing/test_data/repaints_on_scroll_region.json
11  tracing/test_data/flow_big.json
11  tracing/test_data/tcmalloc_multi_renderer.json
13  tracing/test_data/huge_trace.json
31  tracing/test_data/chrome_v8.json
32  tracing/test_data/memory_dumps.json
52  tracing/test_data/theverge_trace.json

(enough separate large files that we can't really fix the problem by just getting rid of a big file)

natduca commented 8 years ago

I'd like to broaden the goal a bit here: catapult.git right now is O(450mb). We have something like 35mb of code + tests, and 15-20mb of checked in third_party data because of not using a dependency system. If I delete these third party stuff and test data, we get to 35mb repo size without .git.

I think we should set a very simple goal: .gitless repo size need to be sizeof(tests) + sizeof(code). And, the size of a git clone needs to be an order of that, so say 50mb. That requires two things:

  1. We remove all things from the repo that should be brought in as dependencies, etc. Including ALL third parties.
  2. We delete the offending files from git history. Primiano can help us with this when we are ready.

With this goal in mind, should we file a fresh bug so we've got clean discussion? Or should we leave this? Also, should we have a design meeting on how to achieve this? Unfortunately, this is going to be required for the lighthouse project, so this is a bit higher priority than just "code health."

Apeliotes commented 8 years ago

I don't have a preference on whether a new bug gets filed or we continue the discussion here.

I'm happy to help, and am working on a doc to help other people get up to speed on using the dependency manager so we can easily shard the work out.

natduca commented 8 years ago

@Apeliotes can I assign you this and wrangle you into committing to a milestone though? I think we need someone to step up and drive this to completion. Definitely we can rally people to help the cause, but the first thing we need is a person to Make It Happen, no matter what. Its pretty urgent for lighthouse that we resolve this too: p1, in fact.

benshayden commented 8 years ago

One crazy simple baby-step solution: Make new repo catapult-project/test_data.git Move catapult/.../test-data to there Make dev server scream if test_data.git isn't in catapult/third_party Make a bin script to download test_data.git to third_party, or make the dev server do it automatically

Apeliotes commented 8 years ago

@anniesullie, could you setup a catapult-project/testing-data repo for the tracing test data? Ned, Ben and I were talking about this offline today.

anniesullie commented 8 years ago

How do we plan to sync this repo?

benshayden commented 8 years ago

Manually?

These files are used for manual testing only. One main expected usage is that one of us adds a trace file for a specific purpose in order to share it with teammates, so we'd tell them to sync their test_data repo. Another main expected use case is testing new features like the metrics side panel using well-understood traces, so we'd keep these files around and shared.

benshayden commented 8 years ago

Thanks for creating the test-data repo, @anniesullie ! I'll copy files from catapult's test-data and @natduca 's measurmt-traces to that repo at some point soon, then update the dev server and trace-viewer to use the new repo, then delete the files from catapult.

Updating the dev server and trace-viewer is actually a bit complicated. Here's a design that @natduca was somewhat positive about.

trace-viewer keeps its filename select box, but it changes to Most Recently Used and is not populated by dev server listing the contents of the test-data folder, because that folder will eventually be deleted, because that's the entire point of this issue. A new Open button is added. However, the browser's file picker never returns the full file path, so the "Open" button will ask the dev server to open a tkinter file picker. The file picker can take a hint to start in the directory containing the most recently opened file, which should usually be a local clone of the new test-data repo. The dev server would return the full path of the selected file to trace viewer, which would then update its MRU and read the file through the dev server as it currently does. The MRU can be stored in tr.b.Settings. This design would require users to manually find the test-data repo once, but trace-viewer can walk them through that when the MRU is empty.

@natduca suggested having the dev server copy the selected file to catapult's test-data directory so that the code that lists the contents of the test-data directory can continue to be used, but I only ever use a few test data files, so listing dozens of files that I don't use in the select box is a bug for me, not a feature, though if enough people find it useful then the trace viewer could ask the dev server to list the contents of the directory containing the most recently used trace file.

natduca commented 8 years ago

Things I don't think sound workable:

What I do think is workable is that the trace viewer works as today. But, when the devserver gets a list_files request, it goes and looks for the test_data repo. If its not checked out, it can cause the XHR to fail with an error message that the examples/trace_viewer.html can show that prompts them to clone the test-data directory. Or, frankly, it could just clone the test data directory.

The upload/open button should just be client-side, like how the load button works in profiling_view.html:https://github.com/catapult-project/catapult/blob/master/tracing/tracing/ui/extras/about_tracing/profiling_view.html#L289 . That shows an open file dialog using the browser and then you get back an arraybuffer with the file in javascript. Then just do an XHR to post that to /test_data/new_file, and on the python side in the tracing_dev_server_config you add a handler for that endpoint that takes that file and writes it to the test_data folder. Then trigger a reload of the page(document.location = document.location), and the file will be there in the list. Done.

eakuefner commented 7 years ago

This is turning out to be an actual issue; for context see https://bugs.chromium.org/p/chromium/issues/detail?id=670284