biocore / qiime

Official QIIME 1 software repository. QIIME 2 (https://qiime2.org) has succeeded QIIME 1 as of January 2018.
GNU General Public License v2.0
285 stars 268 forks source link

Merge qiime_test_data repo into QIIME #590

Closed jairideout closed 11 years ago

jairideout commented 11 years ago

The qiime_test_data repository is currently separate from QIIME. I think it'd be a good idea to merge it into the QIIME repository, as this will help sync up the two in terms of versioning, as well as help developers remember to update the test data as they make changes to script input/output. Plus users would have immediate access to the test data.

The downside is that the QIIME repo will grow in size: qiime_test_data is currently around 278 MB. However, this size would decrease for release .tar.gz's as we'd be compressing it.

ElDeveloper commented 11 years ago

I'm not a fan of this idea. Making downloads necessarily greater than 250 MB seems like a bad idea to me.

I think making a tutorial about using the qiime_test_data repository would probably be better, from the developer and from the user perspective. But of course, I could be wrong.

jairideout commented 11 years ago

I completely agree that inflating the size of the repo isn't great, but there are really only a handful of scripts that make up a majority of the size. Perhaps we could decrease the size of the test data in conjunction with issue #582?

gregcaporaso commented 11 years ago

Yeah, I bet we could reduce the size. One quick way to do that would be to have alpha_rarefaction.py automatically gzip results that are not plots (i.e., all of the rarefied OTU tables, etc).

gregcaporaso commented 11 years ago

I'm thinking that we need to do this, but first need to reduce the total output size. It's currently very clunky to have to add changes to both repos and time pull requests with one another.

Here are the individual directory sizes in kbytes:

15:48:33 qiime_test_data@master$ du -k -d 1 . | sort -nr
56796   ./core_qiime_analyses
40148   ./align_seqs
28324   ./make_rarefaction_plots
20752   ./jackknifed_beta_diversity
16884   ./inflate_denoiser_output
11212   ./alpha_rarefaction
8284    ./denoiser_preprocess
6204    ./beta_diversity_through_plots
5192    ./compare_3d_plots
4344    ./pick_subsampled_reference_otus_through_otu_table
4012    ./pick_otus_through_otu_table
3792    ./make_3d_plots
3628    ./plot_taxa_summary
3356    ./filter_alignment
2848    ./make_fastq
2408    ./pick_otus
2268    ./make_2d_plots
1676    ./parallel_multiple_rarefactions
1632    ./summarize_taxa_through_plots
1256    ./make_distance_histograms
860 ./make_otu_heatmap_html
860 ./filter_otus_by_sample
820 ./adjust_seq_orientation
672 ./pick_rep_set
576 ./pick_reference_otus_through_otu_table
500 ./multiple_rarefactions
488 ./parallel_pick_otus_uclust_ref
468 ./make_otu_network
436 ./parallel_pick_otus_trie
432 ./parallel_pick_otus_blast
396 ./parallel_beta_diversity
344 ./filter_samples_from_otu_table
316 ./shared_phylotypes
308 ./alpha_diversity
304 ./beta_diversity
292 ./summarize_taxa
292 ./parallel_alpha_diversity
292 ./otu_category_significance
292 ./compute_core_microbiome
272 ./split_libraries_fastq
268 ./multiple_rarefactions_even_depth
236 ./filter_otus_from_otu_table
212 ./sort_otu_table
208 ./summarize_otu_by_cat
204 ./split_otu_table
204 ./make_distance_comparison_plots
204 ./collate_alpha
192 ./supervised_learning
180 ./make_tep
164 ./simsam
156 ./split_otu_table_by_taxonomy
156 ./plot_rank_abundance_graph
156 ./make_phylogeny
140 ./filter_taxa_from_otu_table
136 ./map_reads_to_reference
124 ./convert_otu_table_to_unifrac_sample_mapping
120 ./tree_compare
120 ./single_rarefaction
120 ./make_distance_boxplots
120 ./demultiplex_fasta
116 ./parallel_identify_chimeric_seqs
112 ./split_libraries
92  ./parallel_align_seqs_pynast
92  ./filter_distance_matrix
84  ./beta_significance
80  ./parallel_map_reads_to_reference
76  ./plot_semivariogram
76  ./per_library_stats
76  ./consensus_tree
76  ./check_id_map
68  ./parallel_blast
68  ./compare_taxa_summaries
68  ./assign_taxonomy
52  ./filter_fasta
48  ./quality_scores_plot
48  ./filter_tree
44  ./nmds
44  ./compare_distance_matrices
40  ./upgma_cluster
40  ./principal_coordinates
40  ./parallel_assign_taxonomy_rdp
40  ./parallel_assign_taxonomy_blast
40  ./neighbor_joining
36  ./make_prefs_file
36  ./convert_unifrac_sample_mapping_to_otu_table
36  ./add_alpha_to_mapping_file
32  ./transform_coordinate_matrices
28  ./truncate_fasta_qual_files
28  ./split_fasta_on_sample_ids
28  ./make_otu_table
24  ./merge_otu_maps
24  ./convert_fastaqual_fastq
24  ./compare_alpha_diversity
20  ./subsample_fasta
20  ./relatedness
20  ./extract_seqs_by_sample_id
20  ./dissimilarity_mtx_stats
16  ./truncate_reverse_primer
16  ./distance_matrix_from_mapping
16  ./count_seqs
16  ./conditional_uncovered_probability
16  ./compare_categories
16  ./cluster_quality
16  ./add_qiime_labels
12  ./poller
12  ./merge_otu_tables
12  ./merge_mapping_files
12  ./make_qiime_py_file
12  ./blast_wrapper
8   ./unweight_fasta
8   ./trflp_file_to_otu_table
8   ./load_remote_mapping_file
4   ./start_parallel_jobs_torque
4   ./start_parallel_jobs_sc
4   ./start_parallel_jobs
4   ./print_metadata_stats
4   ./identify_missing_files
gregcaporaso commented 11 years ago

Having the qiime_test_data repository separate from the qiime repository is increasingly inconvenient. The issues are as follows:

1: we often have to coordinate merges in the two repositories, which is annoying and easy to forget 2: pull request reviewers have to look in two places to see changes 3: most importantly, in cases where a qiime PR requires a qiime_test_data PR to be merged in order for the script usage tests to pass, the jenkins test of that PR will fail.

Right now the qiime_test_data repo is 261 MB. Even though this is large, I think it's worth merging into QIIME. A lot of the larger files are the ones associated with graphics that we're planning to refactor in the future, so the repo should start out large but decrease in size over time.

@ElDeveloper was the only person to raise objections last time. I think we should do this for 1.7.0, and reduce the size from there. Objections?

jairideout commented 11 years ago

:+1:

ElDeveloper commented 11 years ago

My only concern was related to the size of the repo but the issues that come with having two separate repositories are incredibly annoying (specially for the admins). Other's opinions on the topic would be really useful.

josenavas commented 11 years ago

I can understand the problems of having a single repository, but I think they're minor issues compared with the amount of overhead introduced of having two repositories that has to be up to date at the same time. So I will vote for have a single repository.

ElDeveloper commented 11 years ago

FWIW just downloaded the latest version of the test_data_repo and it's ~70 MB. If we consider what @gregcaporaso mentioned, then this size will (very likely) continue to decrease.


Also, sorry for spamming everyone's inbox, GitHub has been working funny all day.

justin212k commented 11 years ago

I think that'd reduce complexity, so I'm in favor of merging the repos.

wasade commented 11 years ago

+1

On Mar 26, 2013, at 17:37, justin212k notifications@github.com wrote:

I think that'd reduce complexity, so I'm in favor of merging the repos.

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/590#issuecomment-15495839 .

douginator2000 commented 11 years ago

+1

gregcaporaso commented 11 years ago

OK, I think we're in agreement on this so I'm going to set this as a to-do item for the 1.7.0 milestone.

gregcaporaso commented 11 years ago

Ideally we would want to keep revision history - does anyone know if that's possible?

jairideout commented 11 years ago

Looks like it's possible:

http://stackoverflow.com/questions/1425892/how-do-you-merge-two-git-repositories

gregcaporaso commented 11 years ago

Thanks for the link @jrrideout. I followed the instructions from Jakub and they worked perfectly.