aboutcode-org / typecode

7 stars 9 forks source link

Filetype detection differences on Windows #12

Open JonoYang opened 4 years ago

JonoYang commented 4 years ago

Some of the tests have different results on Windows:

FAILED tests/typecode/test_contenttype.py::TestContentTypeComplex::test_size
FAILED tests/typecode/test_types.py::TestFileTypesDataDriven::test_filetest_archive_e_tar_gz_4
FAILED tests/typecode/test_types.py::TestFileTypesDataDriven::test_filetest_archive_file_4_26_1_diff_gz_5
FAILED tests/typecode/test_types.py::TestFileTypesDataDriven::test_filetest_code_c_netdb_h_44
FAILED tests/typecode/test_types.py::TestFileTypesDataDriven::test_filetest_code_java_appender_java_53
FAILED tests/typecode/test_types.py::TestFileTypesDataDriven::test_filetest_code_java_commonviewersitefactory_jad_55
FAILED tests/typecode/test_types.py::TestFileTypesDataDriven::test_filetest_code_java_logger_java_56
FAILED tests/typecode/test_types.py::TestFileTypesDataDriven::test_filetest_code_java_contenttype_java_57
FAILED tests/typecode/test_types.py::TestFileTypesDataDriven::test_filetest_code_python___init___py_59
FAILED tests/typecode/test_types.py::TestFileTypesDataDriven::test_filetest_config_defconfig_ar531x_jffs2_71
FAILED tests/typecode/test_types.py::TestFileTypesDataDriven::test_filetest_doc_office_glitch_erd_vsd_84
FAILED tests/typecode/test_types.py::TestFileTypesDataDriven::test_filetest_doc_office_word_doc_91

In the case of test_filetest_code_java_logger_56, different filetype, mimetypes, and file size were detected:

Expected result for test_filetest_code_java_logger_56:

filetype_file: Java source, ASCII text
mimetype_file: text/x-java
mimetype_python: text/x-java-source
filetype_pygment: Java
programming_language: Java
is_file: yes
is_regular: yes
size: 6800
is_text: yes
contains_text: yes
is_source: yes

Result for test_filetest_code_java_logger_56:

filetype_file: ASCII text, with CRLF line terminators
mimetype_file: text/plain
mimetype_python: text/x-java-source
filetype_pygment: Java
programming_language: Java
is_file: yes
is_regular: yes
size: !!int '7013'
is_text: yes
contains_text: yes
is_java_source: yes
is_source: yes

The detected types and size should be the same. The other failing tests have similar issues.

JonoYang commented 4 years ago

We use os.path.getsize to get the size of a file. I ran os.path.getsize('configure.bat') on Windows and Ubuntu to see if we get a difference in size. On Windows, configure.bat is 4185 bytes and on Ubuntu, it is 4064 bytes. The size disrepency is due to the line ending differences between Windows and Linux. Text files on Windows end in /r/n, rather than just /n as in Linux. So for every line in a text file on windows, there is an extra byte.

On a freshly cloned repo, typecode/tests/typecode/data/contenttype/size/dir/a.txt is 2 bytes in Ubuntu and 3 bytes in Windows.

pombredanne commented 4 years ago

@JonoYang this difference would be quite a surprise!

pombredanne commented 4 years ago

I did a quick check and the size is the same for me, given the exact same file (sha1-wise) as an input. IMHO you must have by tripped by using different checkouts or branches on each OSes :)

JonoYang commented 4 years ago

@pombredanne so configure.bat is 4064 bytes on windows and ubuntu for you?

On a freshly cloned typecode repo:

C:\Users\Jono\Desktop\typecode-new-new>python
Python 2.7.17 (v2.7.17:c2f86d86e6, Oct 19 2019, 21:01:17) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.path.getsize('configure.bat')
4185L
>>> len(open('configure.bat').read())
4064
>>>

I think github is cloning the repo and adding the crlf endings automatically to text files on Windows. The same thing is happening on the azure ci wrt TestContentTypeComplex.test_size

JonoYang commented 4 years ago

@pombredanne Got around the size detection difference by modifying the windows azure pipeline job template to run git config --global core.autocrlf false before checking out the repo so it doesn't replace the line endings when checking out files. I also had to add a .gitattributes file that sets configure.bat eol=crlf. I found out that the script did not run properly when windows runs a batch script that has LF line endings rather than CRLF.

There are still filetype/mimetype detection differences remaining.