`draft create` detects Ruby on Rails repo as being Markdown

jmeickle commented 6 years ago

--> Draft detected Markdown (96.790314%)
--> Could not find a pack for Markdown. Trying to find the next likely language match...
--> Draft detected Ruby (1.393127%)
DEBU[0000] pack path: /Users/eronarn/.draft/packs/github.com/Azure/draft/packs/ruby 
--> Ready to sail

I can't share the repo, but it is a totally standard repo - Gemfile, Rakefile, lots of .rb, etc. Most of the individual files were reported as being SCSS, Haml, or Ruby - but the primary detection ended up as Markdown, somehow? It still worked, but it's not a great first time experience IMO

bacongobbler commented 6 years ago

While you can't share the source, you can inspect how draft create is performing its detection by running with the --debug flag on. For example, trying this with the [example-ruby app]() gives me:

><> draft create --debug
DEBU[0000] with file:  .                                
DEBU[0000] . is 4096 bytes                              
DEBU[0000] with file:  Gemfile                          
DEBU[0000] Gemfile is 44 bytes                          
DEBU[0000] Gemfile got result by name:  Ruby            
DEBU[0000] with file:  Gemfile.lock                     
DEBU[0000] Gemfile.lock is 325 bytes                    
DEBU[0000] Gemfile.lock got result by name:  Ruby       
DEBU[0000] with file:  app.rb                           
DEBU[0000] app.rb is 100 bytes                          
DEBU[0000] app.rb got result by name:  Ruby             
DEBU[0000] language: Ruby percent: 100.000000 color: #701516 
DEBU[0000] linguist.ProcessDir('.') result:

Error: <nil> 
DEBU[0000] Ruby:    100.000000 (#701516)                   
--> Draft detected Ruby (100.000000%)
DEBU[0000] pack path: /home/bacongobbler/.draft/packs/github.com/Azure/draft/packs/ruby 
--> Ready to sail

linguist (a Go port of github/linguist, the system internal to Draft for language detection) works by using a Naive Bayesian Classifier that is trained using "lazy consensus" based on the bytes of code in each programming language, and it doesn't consider the file extension for weighing. I know there was a proposal out there to perform weighted searches based on the file type in https://github.com/github/linguist/issues/2195, but it was closed due to inactivity. That being said, that doesn't mean we can't train our own Naive Bayesian Classifier to do weighted searches!

In the meantime, certain directories are ignored by default as "documentation". For example: docs, Documentation and Examples are all ignored by default, but it doesn't catch all directories, as probably in your case.

Once you identify which directory (or files) are being detected as Markdown (likely a subdirectory containing documentation), you can add them to the ignore list as the troubleshooting docs recommends. After that, your app will be properly detected as Ruby.

Let me know if that helps!

bacongobbler commented 6 years ago

It would be helpful to know what the name of the directories containing markdown are called! That way perhaps we can submit a PR to github/linguist to add those to the default ignore list.

jmeickle commented 6 years ago

Ah, this is the problem then:

ip-192-168-1-126:qdash eronarn$ grep -E -i '(markdown|md)' draft.log
--> Draft detected Markdown (96.396980%)
--> Could not find a pack for Markdown. Trying to find the next likely language match...
time="2018-03-22T09:54:50-04:00" level=debug msg="with file:  README.md"
time="2018-03-22T09:54:50-04:00" level=debug msg="README.md is 11860 bytes"
time="2018-03-22T09:54:50-04:00" level=debug msg="README.md : filename should be ignored, skipping"
time="2018-03-22T09:54:50-04:00" level=debug msg="with file:  ansible/README.md"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/README.md is 735 bytes"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/README.md : filename should be ignored, skipping"
time="2018-03-22T09:54:50-04:00" level=debug msg="with file:  ansible/roles/collectd/README.md"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/collectd/README.md is 2906 bytes"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/collectd/README.md : filename should be ignored, skipping"
time="2018-03-22T09:54:50-04:00" level=debug msg="with file:  ansible/roles/consul/KEY_ROTATION.md"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/consul/KEY_ROTATION.md is 1073 bytes"
time="2018-03-22T09:54:50-04:00" level=debug msg="reading contents of ansible/roles/consul/KEY_ROTATION.md"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/consul/KEY_ROTATION.md got language hints: []string{\"GCC Machine Description\", \"Markdown\"}\n"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/consul/KEY_ROTATION.md got result by data:  Markdown"
time="2018-03-22T09:54:50-04:00" level=debug msg="with file:  ansible/roles/consul/README.md"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/consul/README.md is 2903 bytes"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/consul/README.md : filename should be ignored, skipping"
time="2018-03-22T09:54:50-04:00" level=debug msg="with file:  ansible/roles/envvars/README.md"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/envvars/README.md is 446 bytes"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/envvars/README.md : filename should be ignored, skipping"
time="2018-03-22T09:54:50-04:00" level=debug msg="with file:  ansible/roles/papertrail/README.md"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/papertrail/README.md is 664 bytes"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/papertrail/README.md : filename should be ignored, skipping"
time="2018-03-22T09:54:50-04:00" level=debug msg="with file:  ansible/roles/papertrail/files/remote_syslog.systemd.service"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/papertrail/files/remote_syslog.systemd.service is 297 bytes"
time="2018-03-22T09:54:50-04:00" level=debug msg="reading contents of ansible/roles/papertrail/files/remote_syslog.systemd.service"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/papertrail/files/remote_syslog.systemd.service got language hints: []string(nil)\n"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/papertrail/files/remote_syslog.systemd.service got result by data:  Shell"
time="2018-03-22T09:54:50-04:00" level=debug msg="with file:  ansible/roles/rvm/README.md"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/rvm/README.md is 5298 bytes"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/rvm/README.md : filename should be ignored, skipping"
time="2018-03-22T09:54:50-04:00" level=debug msg="with file:  ansible/roles/ssh-users/README.md"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/ssh-users/README.md is 880 bytes"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/ssh-users/README.md : filename should be ignored, skipping"
time="2018-03-22T09:54:50-04:00" level=debug msg="with file:  ansible/roles/vault-module/README.md"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/vault-module/README.md is 1611 bytes"
time="2018-03-22T09:54:50-04:00" level=debug msg="ansible/roles/vault-module/README.md : filename should be ignored, skipping"
time="2018-03-22T09:54:51-04:00" level=debug msg="strace.out got result by data:  Markdown"
time="2018-03-22T09:54:51-04:00" level=debug msg="language: Markdown percent: 96.396980 color: "
time="2018-03-22T09:54:51-04:00" level=debug msg="Markdown:\t96.396980 ()"

Turns out that I had an strace log file in the directory. It got read as Markdown, and because it's so large (60 megabytes), it dominated the results.

So perhaps one of these:

1) Develop a classifier for strace output (should be trivial) and ignore anything that matches it. 2) Ignore single files that are disproportionately large, because they are likely to be miscategorized output logs/binary artifacts/etc.

bacongobbler commented 6 years ago

sounds good! once we implement #593 this should be a relatively simple fix.

Azure / draft-classic

`draft create` detects Ruby on Rails repo as being Markdown #591