aist-oceanworks / mudrod

Mining and Utilizing Dataset Relevancy from Oceanographic Datasets to Improve Data Discovery and Access, online demo: https://mudrod.jpl.nasa.gov/#/
https://mudrod.github.io/
15 stars 15 forks source link

Benchmark log ingest #49

Closed Yongyao closed 7 years ago

Yongyao commented 8 years ago

Perform log ingestion tests to benchmark log ingestion speed.

Yongyao commented 8 years ago

For your reference, this is the preprocessing time of logs of 201409. The importing time is 4000+s now.

2016-08-25 17:14:25,543 INFO esiptestbed.mudrod.discoveryengine.WeblogDiscoveryEngine: ****Web log preprocessing starts*** 2016-08-25 17:14:25,543 INFO esiptestbed.mudrod.discoveryengine.WeblogDiscoveryEngine: ****Web log preprocessing starts***** 201409 2016-08-25 17:14:25,543 INFO esiptestbed.mudrod.weblog.pre.ImportLogFile: **Import starts******* 2016-08-25 18:04:57,794 INFO esiptestbed.mudrod.weblog.pre.ImportLogFile: Num of http: 5011713 2016-08-25 18:28:25,339 INFO esiptestbed.mudrod.weblog.pre.ImportLogFile: Num of FTP: 4220940 2016-08-25 18:28:30,775 INFO esiptestbed.mudrod.weblog.pre.ImportLogFile: ****Import ends****_Took 4445s 2016-08-25 18:28:30,986 INFO esiptestbed.mudrod.weblog.pre.CrawlerDetection: *_****Crawler detection starts******* 2016-08-25 18:31:32,030 INFO esiptestbed.mudrod.weblog.pre.CrawlerDetection: User count: 8956 2016-08-25 18:31:32,030 INFO esiptestbed.mudrod.weblog.pre.CrawlerDetection: ****Crawler detection ends****_Took 181s 2016-08-25 18:31:32,500 INFO esiptestbed.mudrod.weblog.pre.SessionGenerator: *_****Session generating starts******* 2016-08-25 18:37:04,597 INFO esiptestbed.mudrod.weblog.pre.SessionGenerator: ****Session generating ends****_Took 332s 2016-08-25 18:37:04,597 INFO esiptestbed.mudrod.weblog.pre.SessionStatistic: *_****Session summarizing starts******* 2016-08-25 18:39:02,960 INFO esiptestbed.mudrod.weblog.pre.SessionStatistic: Session count: 1102 2016-08-25 18:39:03,600 INFO esiptestbed.mudrod.weblog.pre.SessionStatistic: ****Session summarizing ends****_Took 119s 2016-08-25 18:39:03,600 INFO esiptestbed.mudrod.discoveryengine.WeblogDiscoveryEngine: *_****Web log preprocessing ends****Took 5078s 201409 2016-08-25 18:39:03,630 INFO esiptestbed.mudrod.weblog.pre.HistoryGenerator: *******HistoryGenerator starts******* 2016-08-25 18:39:04,217 INFO esiptestbed.mudrod.weblog.pre.HistoryGenerator: ****HistoryGenerator ends****_Took 0s 2016-08-25 18:39:04,217 INFO esiptestbed.mudrod.weblog.pre.ClickStreamGenerator: *_****ClickStreamGenerator starts******* 2016-08-25 18:40:33,963 INFO esiptestbed.mudrod.weblog.pre.ClickStreamGenerator: ****ClickStreamGenerator ends****_Took 89s 2016-08-25 18:40:33,963 INFO esiptestbed.mudrod.discoveryengine.WeblogDiscoveryEngine: *_****Web log preprocessing (user history and clickstream finished) ends*****

lewismc commented 8 years ago

I will attach a profiler today and try to find out where the ecrease in writer performance is here.

lewismc commented 8 years ago

@Yongyao @quintinali lets have a call shortly to discuss how we can write a test suite for this. It is a critically import aspect of the Mudrod application and I think we need to agree upon how this is going to work.

Yongyao commented 8 years ago

@lewismc No problem. Just let us know your best time.

lewismc commented 7 years ago

@quintinali are you able to provide some statistics regarding log ingest? Some basic graphs ans metrics perhaps running against the data provided at #85 ?

quintinali commented 7 years ago

I did a test using 2014 PO.DAAC January logs and the processing time of each step before modification is as following: import log: 3140 s crawler detection: 130 s session identification: 603 s

After modification, if I set up a cluster with five virtual machines(16GB RAM, 8 GPU cores), the processing time is significantly decreased. import log: 152 s crawler detection: 41 s session identification: 85 s

I will do a test on the data provided at #85 .

lewismc commented 7 years ago

Can you please create a wiki page and document all of this? Thank you very much. Once you've created the wiki page and populated it, please close off this issue.

quintinali commented 7 years ago

Ok, I will do that

lewismc commented 7 years ago

@quintinali any chance to have a look at this?

lewismc commented 7 years ago

@quintinali PING

quintinali commented 7 years ago

https://github.com/mudrod/mudrod/wiki/Benchmark-log-ingest

lewismc commented 7 years ago

Hi @quintinali some suggested improvements

Thank you this is excellent to see the performance improvements we've been able to achieve.

quintinali commented 7 years ago

@lewismc Thanks for your suggestions. I will add these links.