NCEAS / open-science-codefest

Web site and planning materials for open science conference.
http://nceas.github.io/open-science-codefest
12 stars 10 forks source link

Run Distributed Release Audit Tool (DRAT) on all codefest generated code and report out on license statistics #27

Open chrismattmann opened 10 years ago

chrismattmann commented 10 years ago

Organizational Page: DRAT

DRAT (https://github.com/chrismattmann/drat/) is a release audit tool that takes Apache RAT and turns it into a Map Reduce style system for large and heterogeneous code bases where RAT falls flat on its face. RAT is unable to easily differentiate between different file MIME types and tries to do license analysis on e.g., binary files unless specified through complex white lists and black lists. DRAT on the other hand, improves upon RAT by taking Apache Tika, partitioning the code base by MIME type, constructing a Solr4 catalog of the code, and then farming out a large Map Reduce style job wherein which the Mapper is a N-sized (configurable, set initially to 100) set of files of the same MIME type, partitioned across machines using Apache OODT, and the reducer is the RAT log aggregator that combines each Mapped RAT job's intermediate RAT log output.

DRAT has been run on the DARPA XDATA code base (~50K thousand files, 10s of M of lines of code), and the Computational Infrastructure for Geodynamics (CIG) (~500K thousand files, 100s of M of lines of code) and scales well, is easy to use and the software can be run on a single machine with an existing OS or ran using Vagrant and vagrant up as a virtual machine.

This task will involve deploying DRAT, and then running it across the code bases to perform a license analysis and to report out on the results at the end of the codefest. Patches and improvements to DRAT are welcomed as well.

chrismattmann commented 10 years ago

this is a proposed session.

chrismattmann commented 10 years ago

@lewismc FYI

chrismattmann commented 10 years ago

@tpaluslich FYI

lewismc commented 10 years ago

Excellent idea @chrismattmann. This will be interesting.

chrismattmann commented 10 years ago

thanks @mbjones for adding the labels thanks @lewismc

chrismattmann commented 10 years ago

Hi all, anyone have some specific pointers to repositories to run DRAT on post workshop? @mbjones ideas?

chrismattmann commented 10 years ago

@mbjones ping ^^ any specific pointers to code from the Code Fest that was generated? Just looking for URLs to run DRAT on?

chrismattmann commented 10 years ago

Ray Idaszak pointed mere here: https://github.com/JeffHeard/django_docker_processes

Running DRAT on it now.

chrismattmann commented 10 years ago

OK DRAT is done running: image

image

chrismattmann commented 10 years ago

looks like there are ~12 python files, all unlicensed, Here's the detailed RAT log: https://paste.apache.org/hJgw